Thread: WIP: Faster Expression Processing and Tuple Deforming (including JIT)
Hi Everyone, TL;DR: Making things faster. Architectural evalation. as some of you might be aware I've been working on making execution of larger queries in postgresl faster. While working on "batched execution" I came to the conclusion that, while necessary, isn't currently showing a large benefit because expression evaluation and tuple deforming are massive bottlenecks. I'm posting a quite massive series of WIP patches here, to get some feedback. Tuple deforming is slow because of two reasons: 1) It's the first thing that accesses tuples, i.e. it'll often incur cache misses. That's partially fundamental, but alsopartially can be addressed, e.g. through changing the access order in heap as in [1]. 2) Tuple deforming has a lot of unpredicatable branches, because it has to cope with various types of fields. We e.g. performalignment in a lot of unneeded cases, do null checks for NOT NULL columns et al. I tried to address 2) by changing the C implementation. That brings some measurable speedups, but it's not huge. A bigger speedup is making slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers; but it's still not huge. Finally I turned to just-in-time (JIT) compiling the code for tuple deforming. That doesn't save the cost of 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The first part is done in 0008, the JITing in 0012. Expression evaluation and projection is another major bottleneck. 1) Our recursive expression evaluation puts a *lot* of pressure on the stack. 2) There's a lot of indirect function calls when recursing to other expression nodes. These are hard to predict, becausethe same node type (say ExecEvalAnd()) is used in different parts of an expression tree, and invokes different sub-nodes. 3) The function calls to operators and other functions are hard to predict, leading to a significant number of pipelinestalls. 4) There's a fair amount of pg_list.h list style iteration going on, those are cache and pipeline inefficient. After some experimenting I came to the conclusion that the recursive processing is a fundamental impediment to making this faster. I've converted (0006) expression processing and projection into an opcode dispatch based interpreter. That yields, especially for complex expressions and larger projections a significant speedup in itself. But similarly to the deforming, expression evaluation remains a bottleneck after that, primarily because there's still a lot of unpredictable jump and calls, and because loads/stores have to be complex (e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a single scalar var evaluation). Using the opcode based representation of expression evaluation (as it's nearly linear, and has done a lot of the lookups ahead of time), it's actually quite easy to *After JITing expression evaluation itself is more than ten times faster than before*. But unfortunately that doesn't mean that queries are ten times faster - usually we'll hit bottlenecks elsewhere relatively soon. WRT to expression evaluation, the biggest cost afterwards are the relatively high overhead V1 function calls - register based parameter passing is a lot faster. After experimenting a bit with doing JITing manually (a lot of eye-stabbing kind of fun), I chose to use LLVM. An overview of the patch-queue so far: 0001 Make get_last_attnums more generic. Boring prerequisite. 0002 More efficient AggState->pertrans iteration. Relatively boring minor optimization, but it turns out to be a easily hit bottleneck. Will commit independently. 0003 Avoid materializing SRFs in the FROM list. 0004 Allow ROWS FROM to return functions as single record column. 0005 Basic implementation of targetlist SRFs via ROWS FROM. 0006 Remove unused code related to targetlist SRFs. These are basically just pre-requisites for the faster expression evaluation, and discussed elsewhere [2]. This implementation is *NOT* going to survive, because we ended coming to the conclusion that using a separate executor node to expand SRFs is a btter plan. But the new expression evaluation code won't be able to handle SRFs... 0007 WIP: Optimize slot_deform_tuple() significantly. This a) turns tuple deforming into an opcode based dispatch loop (using computed goto on gcc/clang). b) moves a lot of the logic from slot_deform_tuple() callsites into itself - that turns out to be more efficient. I'm not entirely sure it's worth doing the opcode based dispatch part, if we're going to also do the JIT bit - it's a fair amount of code, and the speed difference only matters on large amounts of rows. 0008 WIP: Faster expression processing and targetlist projection. This, functionally nearly complete, patch turns expression evaluation (and tuple deforming as a special case of that) into a "mini language" which is interpreted using either a while(true) switch(opcode) or computed goto to jump from opcode to opcode. It does so by moving a lot more of the code for expression evaluation to initialization time and building a linear series of steps to evaluate expressions, thereby removing all recursion from expression processing. This nearly entirely gets rid of the stack usage cost of expression evaluation (we pretty much never recurse except for subplans). Being able to remove, now redundant, calls to check_stack_depth() is a noticeable benefit, it turns out that that check has a noticeable performance impact (as it aparently forces to actually use the stack, instead of just renumbering registers inside the CPU). The new representation and evaluation is functionally nearly complete (there's a single regression test failure, and I know why that is), but the code needs a fair amount of polishing. I do absolutely think that the fundamentals of this are the right way to go, and I'm going to work hard on polishing the patch up. But this isn't something that we can easily do in parts, and it's a huge ass patch. So I'd like to have at least some more buyin before wasting even more time on this. 0009 WIP: Add minimal keytest implementation. More or less experimental patch that tries to implement simple expression of the OpExpr(ScalarVar, Const) into a single expression evaluation step. The benefits probably aren't big enough iff we do end up doing JITing of expressions. 0010 WIP: Add configure infrastructure to enable LLVM. 0011 WIP: Beginning of a LLVM JIT infrastructure. Very boring preliminary patches to add --with-llvm and some minimal infrastructure to handle LLVM. If we go this way, JITed stuff needs to be tied to resource owners, and we need some other centralized infrastructure. 0012 Heavily-WIP: JITing of tuple deforming. This, in a not-yet-that-nice manner, implements a JITed version of the per-column stuff that slot_deform_tuple() does. It currently always deforms all columns, which obviously would have to change. There's also considerable additional performance improvements possible. With this patch the per-column overhead (minus bitmap handling, which 0007 moved into a separate loop), drops from 10%+ into low single digits for a number of queries. Afterwards the biggest cost is VARSIZE_ANY() for varlena columns (which atm isn't inlined). That is, besides the initial cache-miss when accessing tuple->t_hoff, which JITing can do nothing about :( This can be enabled/disabled using the new jit_tuple_deforming GUC. To make this production ready in some form, we'd have to come up with a way to determine when it's worth doing JITing. The easiest way would be to do so after N slot_deform_tuple() calls or such, another way would be to do it based on cost estimates. 0013 WIP: ExprEval: Make threaded dispatch use a separate field. Boring preliminary patch. Increases memory usage a bit, needs to be thought through more. 0014 Heavily-WIP: JITed expression evaluation. This is the most-interesting bit performance wise. A few common types of expressions are JITed. Scalar value accesses, function calls, boolean expressions, aggregate references. This can be enabled using the new jit_expressions GUC. Even for the supported expression types I've taken some shortcuts (e.g. strict functions aren't actually strict). The performance benefits are quite noticeable. For TPCH ExecEvalExpr() (which is where 0008 moved all of expression evaluation/projection) goes from being the top profile entry, to barely noticeable, with the JITed function usually not showing up in the top five entries anymore. After the patch it becomes very clear that our function call infrastructure is a serious bottlenecks. Passing all the arguments via memory, and, even worse, forcing isnull/values to be on separate cachelines, has significant performance implications. It also becomes quite noticeable that nodeAgg's transition function invocation doesn't go through ExecEvalExpr() but does that itself - which leads to constant mispredictions if several transition values exist. While the JIT code is relatively verbose, it turns out to not actually be that hard to write after some startup pains. All the JITing of expressions that exists so far was basically written in ~10 hours. This also needs some heuristics about when JITing is appropriate. Compiling an expression that's only executed once is never going to be faster than doing the interpretation (it at least needs a writable allocation for the code, and then a remap to make that code read-only and executable). A trace based approach (everything executed at least a thousand times) or cost based (all queries costing more than 100000 should be JITed) could make sense. It's worthwhile to note that at the moment this is a per-query-execution JIT, not something that can trivially be cached for prepared statements. That'll need further infrastructure. 0015 Super-Heavily-WIP: LLVM perf integration. This very very very preliminary patch (including some copy-pasted GPL code!) creates /proc/perf-<pid>.map files, which allows perf to show useful symbols for profile hits to JIT expressions. I plan to push this towards LLVM, so this isn't something PG will have to do, but it's helpful for evaluation. I eventually plan to start separate threads about some of the parts in here, but I think the overal picture needs some discussion first. Q: Why LLVM and not a hand-rolled JIT? A: Because hand-rolling a JIT is probably hard to scale to multiple maintainers, and multiple platforms. I started downthe path of doing a hand-rolled x86 JIT, and that'd also be doable (faster compilation, slower execution basically);but I doubt we'd end up having that on different architectures on platforms. Not to speak of things like properdebugger and profiler integration. I'm not entirely convinced that that's the right path. It might also be a transitionalstep, towards doing our completely own JIT. But I think it's a sensible step. Q: Why LLVM and not $jit-toolkit A: Because all the other JIT stuff I looked at was either really unportable (mostly x86 linux only), inconveniently licensed(like e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for example). I might have missed something,but ISTM that atm the choice is between hand-rolling and using LLVM. Q: Does this actually inline functions from the backend? A: No. That probably is something desirable in the future, but to me that seems like it should be a separate step. The currentone's big enough. It's also further increases compilation times, so quite possibly we only want to do so based onanother set of heuristics. Q: ? Comments? Questions? Regards, Andres [1] https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxcnna@alap3.anarazel.de
On 2016-12-05 19:49:55 -0800, Andres Freund wrote: > I'm posting a quite massive series of WIP patches here, to get some > feedback. And here's the patches themselves - let's hope they're not too big (after gzip'ing that is). Andres
Attachment
- 0001-Make-get_last_attnums-more-generic.patch.gz
- 0002-More-efficient-AggState-pertrans-iteration.patch.gz
- 0003-Avoid-materializing-SRFs-in-the-FROM-list.patch.gz
- 0004-Allow-ROWS-FROM-to-return-functions-as-single-record.patch.gz
- 0005-Basic-implementation-of-targetlist-SRFs-via-ROWS-FRO.patch.gz
- 0006-Remove-unused-code-related-to-targetlist-SRFs.patch.gz
- 0007-WIP-Optimize-slot_deform_tuple-significantly.patch.gz
- 0008-WIP-Faster-expression-processing-and-targetlist-proj.patch.gz
- 0009-WIP-Add-minimal-keytest-implementation.patch.gz
- 0010-WIP-Add-configure-infrastructure-to-enable-LLVM.patch.gz
- 0011-WIP-Beginning-of-a-LLVM-JIT-infrastructure.patch.gz
- 0012-Heavily-WIP-JITing-of-tuple-deforming.patch.gz
- 0013-WIP-ExprEval-Make-threaded-dispatch-use-a-separate-f.patch.gz
- 0014-Heavily-WIP-JITed-expression-evaluation.patch.gz
- 0015-Heavily-WIP-LLVM-perf-integration.patch.gz
Andres Freund <andres@anarazel.de> writes: > I'm posting a quite massive series of WIP patches here, to get some > feedback. I guess the $64 question that has to be addressed here is whether we're prepared to accept LLVM as a run-time dependency. There are some reasons why we might not be: * The sheer mass of the dependency. What's the installed footprint of LLVM, versus a Postgres server? How hard is it to install from source? * How will we answer people who say they can't accept having a compiler installed on their production boxes for security reasons? * Are there any currently-interesting platforms that LLVM doesn't work for? (I'm worried about RISC-V as much as legacy systems.) I concur with your feeling that hand-rolled JIT is right out. But I'm not sure that whatever performance gain we might get in this direction is worth the costs. regards, tom lane
On Tue, Dec 6, 2016 at 1:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> I'm posting a quite massive series of WIP patches here, to get some >> feedback. > > I guess the $64 question that has to be addressed here is whether we're > prepared to accept LLVM as a run-time dependency. There are some reasons > why we might not be: > > * The sheer mass of the dependency. What's the installed footprint of > LLVM, versus a Postgres server? How hard is it to install from source? > > * How will we answer people who say they can't accept having a compiler > installed on their production boxes for security reasons? > > * Are there any currently-interesting platforms that LLVM doesn't work > for? (I'm worried about RISC-V as much as legacy systems.) I think anything that requires LLVM -- or, for that matter, anything that does JIT by any means -- has got to be optional. But I don't think --with-llvm as a compile option is inherently problematic. Also, I think this is probably a direction we need to go. I've heard at least one and maybe several PGCon presentations about people JITing tuple deformation and getting big speedups, and I'd like to finally hear one from somebody who intends to integrate that into PostgreSQL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-12-06 13:56:28 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I'm posting a quite massive series of WIP patches here, to get some > > feedback. > > I guess the $64 question that has to be addressed here is whether we're > prepared to accept LLVM as a run-time dependency. There are some reasons > why we might not be: Indeed. It'd only be a soft dependency obviously. > * The sheer mass of the dependency. What's the installed footprint of > LLVM, versus a Postgres server? How hard is it to install from source? Worked for me first try, but I'm perhaps not the best person to judge. It does take a while to compile though (~20min on my laptop). > * How will we answer people who say they can't accept having a compiler > installed on their production boxes for security reasons? I think they'll just not enable --with-llvm in that case, and get inferior performance. Note that installing llvm does not imply installing a full blown C compiler (although I think that's largely moot, you can get pretty much the same things done with just compiling LLVM IR). > * Are there any currently-interesting platforms that LLVM doesn't work > for? (I'm worried about RISC-V as much as legacy systems.) LLVM itself I don't think is a problem, it seems to target a wide range of platforms. The platforms that don't support JIT compiling might be a bit larger, since that involves more than just generating code. > I concur with your feeling that hand-rolled JIT is right out. But > I'm not sure that whatever performance gain we might get in this > direction is worth the costs. Well, I'm not impartial, but I don't think we do our users a service by leaving significant speedups untackled, and after spending a *LOT* of time on this, I don't see much other choice than JITing. Note that nearly everything performance sensitive is moving towards doing JITing in some form or another. Greetings, Andres Freund
On 2016-12-06 14:04:09 -0500, Robert Haas wrote: > I've heard at least one and maybe several PGCon presentations about > people JITing tuple deformation and getting big speedups, and I'd like > to finally hear one from somebody who intends to integrate that into > PostgreSQL. I certainly want to.
On 2016-12-06 11:10:59 -0800, Andres Freund wrote: > > * Are there any currently-interesting platforms that LLVM doesn't work > > for? (I'm worried about RISC-V as much as legacy systems.) > > LLVM itself I don't think is a problem, it seems to target a wide range > of platforms. The platforms that don't support JIT compiling might be a > bit larger, since that involves more than just generating code. The os specific part is handling the executable format. The JIT we'd be using (MCJIT) has support for ELF, MachO, and COFF. The architecture specific bits seem to be there for x86, arm (small endian, be), aarch64 (arm 64 bits be/le again), mips, ppc64. Somebody is working on RISC-V support for llvm (i.e. it appears to be working, but is not merged) - but given it's not integrated into gcc either, I'm not seing that being an argument. Greetings, Andres Freund
On Tue, Dec 6, 2016 at 2:10 PM, Andres Freund <andres@anarazel.de> wrote: >> * The sheer mass of the dependency. What's the installed footprint of >> LLVM, versus a Postgres server? How hard is it to install from source? > > Worked for me first try, but I'm perhaps not the best person to judge. > It does take a while to compile though (~20min on my laptop). Presumably this is going to need to be something that a user can get via yum install <blah> or apt-get install <blah> on common systems. I wonder how feasible it would be to make this a run-time dependency rather than a compile option. That's probably overcomplicating things, but... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-12-06 15:13:21 -0500, Robert Haas wrote: > Presumably this is going to need to be something that a user can get > via yum install <blah> or apt-get install <blah> on common systems. Right. apt-get install llvm-dev (or llvm-3.9-dev or such if you want to install a specific version), does the trick here. It's a bit easier to develop with a hand compiled version, because then LLVM adds a bootloads of asserts to its IR builder, which catches a fair amount of mistakes. Nothing you'd run in production though (just like you don't use a cassert build...). > I wonder how feasible it would be to make this a run-time dependency > rather than a compile option. That's probably overcomplicating > things, but... I don't think that's feasible at all unfortunately - the compiler IR (which then is JITed by LLVM) is generated via another C API. We could rebuild that one, but that'd be a lot of work. Andres
On Tue, Dec 06, 2016 at 01:56:28PM -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I'm posting a quite massive series of WIP patches here, to get some > > feedback. > > I guess the $64 question that has to be addressed here is whether we're > prepared to accept LLVM as a run-time dependency. There are some reasons > why we might not be: > > * The sheer mass of the dependency. What's the installed footprint of > LLVM, versus a Postgres server? How hard is it to install from source? As long as it's optional, does this matter? A bigger concern might be interface stability. IIRC the LLVM C/C++ interfaces are not very stable, but bitcode is. > * How will we answer people who say they can't accept having a compiler > installed on their production boxes for security reasons? You don't need the front-ends (e.g., clang) installed in order to JIT. > * Are there any currently-interesting platforms that LLVM doesn't work > for? (I'm worried about RISC-V as much as legacy systems.) The *BSDs support more platforms than LLVM does, that's for sure. (NetBSD supports four more, IIRC, including ia64.) But the patches make LLVM optional anyways, so this should be a non-issue. > I concur with your feeling that hand-rolled JIT is right out. But Yeah, that way lies maintenance madness. > I'm not sure that whatever performance gain we might get in this > direction is worth the costs. Byte-/bit-coding query plans then JITting them is very likely to improve performance significantly. Whether you want the maintenance overhead is another story. Sometimes byte-coding + interpretation yields a significant improvement by reducing cache pressure on the icache and the size of the program to be interpreted. Having the option to JIT or not JIT might be useful. Nico --
Andres Freund <andres@anarazel.de> writes: > On 2016-12-06 13:56:28 -0500, Tom Lane wrote: >> I guess the $64 question that has to be addressed here is whether we're >> prepared to accept LLVM as a run-time dependency. There are some reasons >> why we might not be: > Indeed. It'd only be a soft dependency obviously. Oh, so we'd need to maintain both the LLVM and the traditional expression execution code? That seems like a bit of a pain, but maybe we can live with it. >> * How will we answer people who say they can't accept having a compiler >> installed on their production boxes for security reasons? > I think they'll just not enable --with-llvm in that case, and get > inferior performance. Note that installing llvm does not imply > installing a full blown C compiler (although I think that's largely > moot, you can get pretty much the same things done with just compiling > LLVM IR). I'm not entirely thrilled with the idea of this being a configure-time decision, because that forces packagers to decide for their entire audience whether it's okay to depend on LLVM. That would be an untenable position to put e.g. Red Hat's packagers in: either they screw the people who want performance or they screw the people who want security. I think it'd be all right if we can build this so that the direct dependency on LLVM is confined to a separately-packageable extension. That way, a packager can produce a core postgresql-server package that does not require LLVM, plus a postgresql-llvm package that does, and the "no compiler please" crowd simply doesn't install the latter package. The alternative would be to produce two independent builds of the server, which I suppose might be acceptable but it sure seems like a kluge, or at least something that simply wouldn't get done by most vendors. regards, tom lane
Hi, On 2016-12-06 14:19:21 -0600, Nico Williams wrote: > A bigger concern might be interface stability. IIRC the LLVM C/C++ > interfaces are not very stable, but bitcode is. The C API is a lot more stable than the C++ bit, that's the primary reason I ended up using it, despite the C++ docs being better. > > I concur with your feeling that hand-rolled JIT is right out. But > > Yeah, that way lies maintenance madness. I'm not quite that sure about that. I had a lot of fun doing some hand-rolled x86 JITing. Not that is a ward against me being mad. But more seriously: Manually doing a JIT gives you a lot faster compilation times, which makes JIT applicable in a lot more situations. > > I'm not sure that whatever performance gain we might get in this > > direction is worth the costs. > > Byte-/bit-coding query plans then JITting them is very likely to improve > performance significantly. Note that what I'm proposing is a far cry away from that - this converts two (peformance wise two, size wise one) significant subsystems, but far from all the executors to be JIT able. I think there's some more low hanging fruits (particularly aggregate transition functions), but converting everything seems to hit the wrong spot in the benefit/effort/maintainability triangle. - Andres
On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote: > On 2016-12-06 14:19:21 -0600, Nico Williams wrote: > > A bigger concern might be interface stability. IIRC the LLVM C/C++ > > interfaces are not very stable, but bitcode is. > > The C API is a lot more stable than the C++ bit, that's the primary > reason I ended up using it, despite the C++ docs being better. Ah. > > > I concur with your feeling that hand-rolled JIT is right out. But > > > > Yeah, that way lies maintenance madness. > > I'm not quite that sure about that. I had a lot of fun doing some > hand-rolled x86 JITing. Not that is a ward against me being mad. But > more seriously: Manually doing a JIT gives you a lot faster compilation > times, which makes JIT applicable in a lot more situations. What I meant is that each time there are new ISA extensions, or differences in how relevant/significant different implementations of the same ISA implement certain instructions, and/or every time you want to add a new architecture... someone has to do a lot of very low-level work. > > > I'm not sure that whatever performance gain we might get in this > > > direction is worth the costs. > > > > Byte-/bit-coding query plans then JITting them is very likely to improve > > performance significantly. > > Note that what I'm proposing is a far cry away from that - this converts > two (peformance wise two, size wise one) significant subsystems, but far > from all the executors to be JIT able. I think there's some more low Yes, I know. > hanging fruits (particularly aggregate transition functions), but > converting everything seems to hit the wrong spot in the > benefit/effort/maintainability triangle. Maybe? At least with the infrastructure in place for it someone might try it and see. Nico --
On 2016-12-06 15:25:44 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-12-06 13:56:28 -0500, Tom Lane wrote: > >> I guess the $64 question that has to be addressed here is whether we're > >> prepared to accept LLVM as a run-time dependency. There are some reasons > >> why we might not be: > > > Indeed. It'd only be a soft dependency obviously. > > Oh, so we'd need to maintain both the LLVM and the traditional expression > execution code? That seems like a bit of a pain, but maybe we can live > with it. Yea, that's why I converted the "traditional" expression evaluation into a different format first - that way the duplication is a lot lower. E.g. scalar var eval looks like: EEO_CASE(EEO_INNER_VAR): { int attnum = op->d.var.attnum; Assert(op->d.var.attnum >= 0); *op->resnull = innerslot->tts_isnull[attnum]; *op->resvalue= innerslot->tts_values[attnum]; EEO_DISPATCH(op); } in normal evaluation and like case EEO_INNER_VAR: { LLVMValueRef value, isnull; LLVMValueRef v_attnum; v_attnum = LLVMConstInt(LLVMInt32Type(), op->d.var.attnum, false); value = LLVMBuildLoad(builder,LLVMBuildGEP(builder, v_innervalues, &v_attnum, 1, ""), ""); isnull = LLVMBuildLoad(builder,LLVMBuildGEP(builder, v_innernulls, &v_attnum, 1, ""), ""); LLVMBuildStore(builder,value, v_resvaluep); LLVMBuildStore(builder, isnull, v_resnullp); LLVMBuildBr(builder, opblocks[i + 1]); break; } for JITed evaluation. > I'm not entirely thrilled with the idea of this being a configure-time > decision, because that forces packagers to decide for their entire > audience whether it's okay to depend on LLVM. That would be an untenable > position to put e.g. Red Hat's packagers in: either they screw the people > who want performance or they screw the people who want security. Hm. I've a bit of a hard time buying the security argument here. Having LLVM (not clang!) installed doesn't really change the picture that much. In either case you can install binaries, and you're very likely already using some program that does JIT internally. And postgres itself gives you plenty of ways to execute arbitrary code as superuser. The argument for not install a c compiler seems to be that it makes it less convenient to build an executable. I doubt that having a C(++) library for code generation is convenient enough to change the picture there. > I think it'd be all right if we can build this so that the direct > dependency on LLVM is confined to a separately-packageable extension. > That way, a packager can produce a core postgresql-server package > that does not require LLVM, plus a postgresql-llvm package that does, > and the "no compiler please" crowd simply doesn't install the latter > package. That should be possible, but I'm not sure it's worth the effort. The JIT infrastructure will need resowner integration and such. We can obviously split things so that part is independent of LLVM, but I'm unconvinced that the benefit is large enough. > The alternative would be to produce two independent builds of the > server, which I suppose might be acceptable but it sure seems like > a kluge, or at least something that simply wouldn't get done by > most vendors. Hm. We could make that a make target ourselves ;) Regards, Andres
On 2016-12-06 14:35:43 -0600, Nico Williams wrote: > On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote: > > On 2016-12-06 14:19:21 -0600, Nico Williams wrote: > > > > I concur with your feeling that hand-rolled JIT is right out. But > > > > > > Yeah, that way lies maintenance madness. > > > > I'm not quite that sure about that. I had a lot of fun doing some > > hand-rolled x86 JITing. Not that is a ward against me being mad. But > > more seriously: Manually doing a JIT gives you a lot faster compilation > > times, which makes JIT applicable in a lot more situations. > > What I meant is that each time there are new ISA extensions, or > differences in how relevant/significant different implementations of the > same ISA implement certain instructions, and/or every time you want to > add a new architecture... someone has to do a lot of very low-level > work. Yea, that's why I didn't pursue this path further. I *personally* think it'd be perfectly fine to only support JITing on linux x86_64 and aarch64 for now. And those I'd be willing to work on. But since I know that's not project policy... - Andres
On Tue, Dec 06, 2016 at 12:36:41PM -0800, Andres Freund wrote: > On 2016-12-06 15:25:44 -0500, Tom Lane wrote: > > I'm not entirely thrilled with the idea of this being a configure-time > > decision, because that forces packagers to decide for their entire > > audience whether it's okay to depend on LLVM. That would be an untenable > > position to put e.g. Red Hat's packagers in: either they screw the people > > who want performance or they screw the people who want security. There's no security issue. The dependency is on LLVM libraries, not LLVM front-ends (e.g., clang(1)). I don't think there's a real issue as to distros/packagers/OS vendors. They already have to package LLVM, and they already package LLVM libraries separately from LLVM front-ends. > The argument for not install a c compiler seems to be that it makes it > less convenient to build an executable. I doubt that having a C(++) > library for code generation is convenient enough to change the picture > there. The security argument goes back to the days of the Morris worm, which depended on having developer tools (specifically in that case, ld(1), the link-editor). But JIT via LLVM won't give hackers a way to generate or link arbitrary object code. Nico --
Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)
From
Peter Geoghegan
Date:
On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote: > I tried to address 2) by changing the C implementation. That brings some > measurable speedups, but it's not huge. A bigger speedup is making > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers; > but it's still not huge. Finally I turned to just-in-time (JIT) > compiling the code for tuple deforming. That doesn't save the cost of > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The > first part is done in 0008, the JITing in 0012. A more complete motivating example would be nice. For example, it would be nice to see the overall speedup for some particular TPC-H query. -- Peter Geoghegan
On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote: > On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote: > > I tried to address 2) by changing the C implementation. That brings some > > measurable speedups, but it's not huge. A bigger speedup is making > > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers; > > but it's still not huge. Finally I turned to just-in-time (JIT) > > compiling the code for tuple deforming. That doesn't save the cost of > > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The > > first part is done in 0008, the JITing in 0012. > > A more complete motivating example would be nice. For example, it > would be nice to see the overall speedup for some particular TPC-H > query. Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as I've not done that for enough expression types... And you run quickly into other bottlenecks. But here we go for TPCH (scale 10) Q01: master: Time: 33885.381 ms 16.29% postgres postgres [.] slot_getattr 12.85% postgres postgres [.] ExecMakeFunctionResultNoSets10.85% postgres postgres [.] advance_aggregates 6.91% postgres postgres [.] slot_deform_tuple 6.70% postgres postgres [.] advance_transition_function 4.59% postgres postgres [.] ExecProject 4.25% postgres postgres [.] float8_accum 3.69% postgres postgres [.] tuplehash_insert 2.39% postgres postgres [.] float8pl 2.20% postgres postgres [.] bpchareq 2.03% postgres postgres [.] check_stack_depth profile: (note that all expression evaluated things are distributed among many functions) dev (no jiting): Time: 30343.532 ms profile: 16.57% postgres postgres [.] slot_deform_tuple 13.39% postgres postgres [.] ExecEvalExpr 8.64% postgres postgres [.] advance_aggregates 8.58% postgres postgres [.] advance_transition_function 5.83% postgres postgres [.] float8_accum 5.14% postgres postgres [.] tuplehash_insert 3.89% postgres postgres [.] float8pl 3.60% postgres postgres [.] slot_getattr 2.66% postgres postgres [.] bpchareq 2.56% postgres postgres [.] heap_getnext dev (jiting): SET jit_tuple_deforming = on; SET jit_expressions = true; Time: 24439.803 ms profile: 11.11% postgres postgres [.] slot_deform_tuple 10.87% postgres postgres [.] advance_aggregates 9.74% postgres postgres [.] advance_transition_function 6.53% postgres postgres [.] float8_accum 5.25% postgres postgres [.] tuplehash_insert 4.31% postgres perf-10698.map [.] deform0 3.68% postgres perf-10698.map [.] evalexpr6 3.53% postgres postgres [.] slot_getattr 3.41% postgres postgres [.] float8pl 2.84% postgres postgres [.] bpchareq (note how expression eval when from 13.39% to roughly 4%) The slot_deform_cost here is primarily cache misses. If you do the "memory order" iteration, it drops significantly. The JIT generated code still leaves a lot on the table, i.e. this is definitely not the best we can do. We also deform half the tuple twice, because I've not yet added support for starting to deform in the middle of a tuple. Independent of new expression evaluation and/or JITing, if you make advance_aggregates and advance_transition_function inline functions (or you do profiling accounting for children), you'll notice that ExecAgg() + advance_aggregates + advance_transition_function themselves take up about 20% cpu-time. That's *not* including the hashtable management, the actual transition functions, and such themselves. If you have queries where tuple deforming is a bigger proportion of the load, or where expression evalution (including projection) is a larger part (any NULLs e.g.) you can get a lot bigger wins, even without actually optimizing the generated code (which I've not yet done). Just btw: float8_accum really should use an internal aggregation type instead of using postgres array... Andres
On 7 December 2016 at 04:13, Robert Haas <robertmhaas@gmail.com> wrote: > I wonder how feasible it would be to make this a run-time dependency > rather than a compile option. Or something that's compiled with the server, but produces a separate .so that's the only thing that links to LLVM. So packagers can avoid a dependency on LLVM for postgres. I suspect it wouldn't be worth the complexity, the added indirection necessary, etc. If you're using packages then pulling in LLVM isn't a big deal. If you're not, then don't use --with-llvm . -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 7 December 2016 at 14:39, Craig Ringer <craig@2ndquadrant.com> wrote: > On 7 December 2016 at 04:13, Robert Haas <robertmhaas@gmail.com> wrote: > >> I wonder how feasible it would be to make this a run-time dependency >> rather than a compile option. > > Or something that's compiled with the server, but produces a separate > .so that's the only thing that links to LLVM. So packagers can avoid a > dependency on LLVM for postgres. Ahem, next time I'll finish the thread first. Nevermind. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
Dilip Kumar
Date:
On Tue, Dec 6, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > 0009 WIP: Add minimal keytest implementation. > > More or less experimental patch that tries to implement simple > expression of the OpExpr(ScalarVar, Const) into a single expression > evaluation step. The benefits probably aren't big enough iff we do end > up doing JITing of expressions. Seems like we are try to achieve same thing with 'heap scan key push down patch'[1] as well. But I think with this patch you are covering OpExpr(ScalarVar, Const) for all the cases, wherein with [1] we are currently only doing it for seqscan and we are trying to make that generic for other node as well. So do you see any advantage of continuing [1] ? Is there something extra we can achieve with [1] what we can not get with this patch ? https://www.postgresql.org/message-id/CAFiTN-takT6Z4s3tGDwyC9bhYf%2B1gumpvW5bo_fpeNUy%2BrL-kg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
CK Tan
Date:
Andres,
> dev (no jiting):> Time: 30343.532 ms
> dev (jiting):
> SET jit_tuple_deforming = on;
> SET jit_expressions = true;
>
> Time: 24439.803 ms
> SET jit_tuple_deforming = on;
> SET jit_expressions = true;
>
> Time: 24439.803 ms
FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we only jit expression.
Cheers,
-cktan
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
Andres Freund
Date:
Hi, On 2016-12-12 18:11:13 -0800, CK Tan wrote: > Andres, > > dev (no jiting): > > Time: 30343.532 ms > > > dev (jiting): > > SET jit_tuple_deforming = on; > > SET jit_expressions = true; > > > > Time: 24439.803 ms > > FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we > only jit expression. For Q1 I think the bigger win is JITing the transition function invocation in advance_aggregates/transition_function - that's IIRC where the biggest bottleneck lies. If you have any details about your JITing experience that you're willing to talk about ... Regards, Andres
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
CK Tan
Date:
On Mon, Dec 12, 2016 at 6:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> For Q1 I think the bigger win is JITing the transition function
> invocation in advance_aggregates/transition_function - that's IIRC where
> the biggest bottleneck lies.
>
Yeah, we bundle the agg core into our expr work... no point otherwise since we do
it for OLAP.
As for experience, I think you have found out for yourself. There is a lot that
can be done and heuristics are involved in many places to decide whether
to jit fully, partially, or not at all. But it looks like you have a solid basis now
to proceed and explore the beyond :-)
Send me private email if you have a particular question.
Regards,
-cktan
>
>
> For Q1 I think the bigger win is JITing the transition function
> invocation in advance_aggregates/transition_function - that's IIRC where
> the biggest bottleneck lies.
>
Yeah, we bundle the agg core into our expr work... no point otherwise since we do
it for OLAP.
As for experience, I think you have found out for yourself. There is a lot that
can be done and heuristics are involved in many places to decide whether
to jit fully, partially, or not at all. But it looks like you have a solid basis now
to proceed and explore the beyond :-)
Send me private email if you have a particular question.
Regards,
-cktan
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
Bruce Momjian
Date:
On Tue, Dec 6, 2016 at 11:10:59AM -0800, Andres Freund wrote: > > I concur with your feeling that hand-rolled JIT is right out. But > > I'm not sure that whatever performance gain we might get in this > > direction is worth the costs. > > Well, I'm not impartial, but I don't think we do our users a service by > leaving significant speedups untackled, and after spending a *LOT* of > time on this, I don't see much other choice than JITing. Note that > nearly everything performance sensitive is moving towards doing JITing > in some form or another. Agreed, we don't really have a choice. After all the optimizations we have done to so many subsystems, our executor is relatively slow and is a major drag on our system, specifically for long-running queries. The base problem with the executor are the state machines at so many levels, and we really can't optimize that while keeping a reasonable maintenance burden. This is where JIT and LLVM help. I outlined two external projects that were researching this in this blog entry: http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016 I am excited to now be seeing WIP code. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, Here's an updated version of this patchset (now that SRFs are implemented "properly"). This is a large patchset, with later patches being more experimental than earlier ones. 0001: Add expression dependencies on composite type / whole row components. This patch later allows to remove some execution time check (in ExecEvalScalarVar / ExecEvalFieldSelect), which'd be inconvenient to do when JITed. 0002: Make get_last_attnums more generic. What it says. 0003: autoconf test for computed goto support. The faster expression evaluation implementation uses computed gotos when available. So we need to check whether the current compiler supports those. 0004: WIP: Faster expression processing and targetlist projection. The most important patch of the series. Changes expression evaluation from being tree-walk based as done in execQual.c, into being a series of instructions. Then adds an opcode dispatch based interpreter for that mini language. That allows to later add different implementations. Projection is now a special case for expression evaluation. 0005: WIP: Add configure infrastructure to enable LLVM. 0006: WIP: Beginning of a LLVM JIT infrastructure. Needs work. 0007: WIP: JITed expression evaluation. This finally performs runtime compilation of expressions. I think up to 0004 things are getting pretty close to being ready. There's some minor FIXMEs that need to addressed (primarily around CaseTestExpr/CoerceToDomainValue when invoked outside of plans), and some more comment work. The biggest remaining piece of work in the whole series is 0006. Right compiled expressions can be leaked, which is obviously not ok. So I need to write ResourceOwner integration. 0007 covers nearly all types of expressions /queries (function usage isn't supported right now, and there's two missing calls to MakeExpandedObjectReadOnly). There's a lot more that could be done for efficiency (generating smarter LLVM IR basically), but we already see quite significant performance wins. I have later patches that - centralizes tuple deforming logic into slot_deform_logic from slot_getattr, slot_getsomeatts,slot_getallattrs and such - add JITing of tuple deforming - add JITing of aggregation transitions this works by converting transition functions into expression opcodes, and then adding JIT support for those new instructions (EEO_AGG_FILTER / STRICT_INPUT_CHECK / INIT_TRANS / STRICT_TRANS_CHECK / PLAIN_TRANS / ORDERED_TRANS_DATUM / ORDERED_TRANS_TUPLE) but the above 7 patches already are large enough imo. For now 0004 is I think the patch that really could use a look by somebody but me - it's quite a radical departure from how things currently look in master, and it's a fairly large patch. I don't really see a good way to break it down into useful parts :( (the projection changes are the only thing that I can see usefully being done separately). Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Add-expression-dependencies-on-composite-type-whole-.patch.gz
- 0002-Make-get_last_attnums-more-generic.patch.gz
- 0003-Add-autoconf-test-for-computed-goto-support.patch.gz
- 0004-WIP-Faster-expression-processing-and-targetlist-proj.patch.gz
- 0005-WIP-Add-configure-infrastructure-to-enable-LLVM.patch.gz
- 0006-WIP-Beginning-of-a-LLVM-JIT-infrastructure.patch.gz
- 0007-WIP-JITed-expression-evaluation.patch.gz
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
Markus Nullmeier
Date:
On 12/06/16 21:40, Andres Freund wrote: > On 2016-12-06 14:35:43 -0600, Nico Williams wrote: >> On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote: >>> On 2016-12-06 14:19:21 -0600, Nico Williams wrote: >>>>> I concur with your feeling that hand-rolled JIT is right out. But >>>> >>>> Yeah, that way lies maintenance madness. >>> >>> I'm not quite that sure about that. I had a lot of fun doing some >>> hand-rolled x86 JITing. Not that is a ward against me being mad. But >>> more seriously: Manually doing a JIT gives you a lot faster compilation >>> times, which makes JIT applicable in a lot more situations. >> >> What I meant is that each time there are new ISA extensions, or >> differences in how relevant/significant different implementations of the >> same ISA implement certain instructions, and/or every time you want to >> add a new architecture... someone has to do a lot of very low-level >> work. > > Yea, that's why I didn't pursue this path further. I *personally* think > it'd be perfectly fine to only support JITing on linux x86_64 and > aarch64 for now. And those I'd be willing to work on. But since I know > that's not project policy... Well, if this thread of thought about hand-crafted JIT should be taken up again by someone at some point in time, I guess it could be possible to reuse tools that are already out there, such as "DynASM" ( http://luajit.org/dynasm_features.html ) from the LuaJIT project (currently offers x86-64, x86-32, ARM-32, PPC-32, and MIPS-32). Maybe one could even recycle parts of the LuaJIT project itself ( http://luajit.org/luajit.html , http://article.gmane.org/gmane.comp.lang.lua.general/58908 ). -- Markus Nullmeier http://www.g-vo.org German Astrophysical Virtual Observatory (GAVO)
Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)
From
Andres Freund
Date:
Hi, On 2017-02-10 18:18:13 +0100, Markus Nullmeier wrote: > Well, if this thread of thought about hand-crafted JIT should be taken > up again by someone at some point in time, I guess it could be possible > to reuse tools that are already out there, such as "DynASM" > ( http://luajit.org/dynasm_features.html ) from the LuaJIT project > (currently offers x86-64, x86-32, ARM-32, PPC-32, and MIPS-32). > Maybe one could even recycle parts of the LuaJIT project itself > ( http://luajit.org/luajit.html , > http://article.gmane.org/gmane.comp.lang.lua.general/58908 ). FWIW, I'd looked at dynasm/luajit. One big reason to go for LLVM is that it has nearly all the infrastructure to make backend-functions/operators inlineable. Especially for some of the arithmetic operations and such, that'd be quite useful performance-wise. With LLVM you can just use clang on C to generate the IR, do some work to boil down the IR modules to the relevant functions (i.e. remove non sql-callable functions), for which LLVM has infrastructure, and then inline the functions that way. That's a lot harder to do with nearly everything else (save gcc's jit library, but the licensing and stability situation makes that unattractive. Greetings, Andres Freund
Hi, Attached is an updated version of the patchset, but more importantly some benchmark numbers. I ran tpch scale 5, on my laptop with -c shared_buffers=20GB # all data fits into memory -c max_parallel_workers_per_gather=0 # reduce variability -c work_mem=4GB # other suggestions welcome -c huge_pages=on #reduces variability My benchmarking script first prewarms the whole database, then runs the tpch queries in sequence, repeated three times, and compares the shortes execution time: master q01 min: 16176.843 dev-llvm-off min: 13153.179 [diff -22.99] master q02 min: 1280.691 dev-llvm-off min: 1263.032 [diff -1.40] master q03 min: 7330.737 dev-llvm-off min: 7144.982 [diff -2.60] master q04 min: 1014.347 dev-llvm-off min: 991.008 [diff -2.36] master q05 min: 5490.103 dev-llvm-off min: 5439.878 [diff -0.92] master q06 min: 1916.45 dev-llvm-off min: 1818.839 [diff -5.37] master q07 min: 5282.129 dev-llvm-off min: 5222.879 [diff -1.13] master q08 min: 1655.824 dev-llvm-off min: 1532.1 [diff -8.08] master q09 min: 7009.372 dev-llvm-off min: 6724.515 [diff -4.24] master q10 min: 6017.01 dev-llvm-off min: 5848.337 [diff -2.88] master q11 min: 316.724 dev-llvm-off min: 292.51 [diff -8.28] master q12 min: 4829.502 dev-llvm-off min: 4698.14 [diff -2.80] master q13 min: 8679.991 dev-llvm-off min: 8427.614 [diff -2.99] master q14 min: 814.109 dev-llvm-off min: 774.805 [diff -5.07] master q15 min: 1957.248 dev-llvm-off min: 1841.377 [diff -6.29] master q16 min: 1976.544 dev-llvm-off min: 1936.932 [diff -2.05] master q17 min: 559.664 dev-llvm-off min: 446.199 [diff -25.43] master q18 min: 16565.722 dev-llvm-off min: 15475.372 [diff -7.05] master q19 min: 385.222 dev-llvm-off min: 310.398 [diff -24.11] master q20 min: 1717.015 dev-llvm-off min: 1389.064 [diff -23.61] master q22 min: 753.017 dev-llvm-off min: 637.152 [diff -18.18] (note that there's a fair amount of per-run variance, I've seen one or two of the small differences go the other way in previous runs) It's clearly visible that the performance gain heavily depends on the type of query. Which makes sense - expression evaluation isn't a bottleneck everywhere. The actual gains in expression evalution are larger than the maximum here, but bottlenecks shift. Besides cleanups the important changes in this version of the patches are: - I've added a fastpath for the interpretation of very simple expressions (non-sysvar, single column Vars and Consts), before that performance regressed slightly for cases that evaluated a *lot* of Vars/Consts. The only case where I could actually reproduce that is in large hash-joins where the to-be-hashed value is extracted on its own.[1;5A - I moved the invocation of expression evaluation back to a callback. That's better for predicting branches both with JITing and when using fastpath functions. - removed used EXEC_EVALDEBUG code - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Mar 6, 2017 at 10:01 PM, Andres Freund <andres@anarazel.de> wrote: > My benchmarking script first prewarms the whole database, then runs the > tpch queries in sequence, repeated three times, and compares the shortes > execution time: Those numbers are stupendous. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I looked over 0001-Add-expression-dependencies-on-composite-type-whole-.patch. That seems useful independent of the things you have following. Even though it is presented more as a preparatory change, it appears to have noticeable user-facing impact, which we should analyze. For example, in the regression test, the command alter table tt14t drop column f3; is now rejected, rightly, but the command alter table tt14t drop column f3 cascade; ends up dropping the whole view, which might be a bit surprising. We don't support dropping columns from views, so there might be no alternative, but I wonder what else is changing because of this. I think that might deserve a bit more analysis and perhaps some test cases. --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -206,6 +206,9 @@ static bool object_address_present_add_flags(const ObjectAddress *object,static bool stack_address_present_add_flags(const ObjectAddress *object, int flags, ObjectAddressStack *stack); +static void add_type_addresses(Oid typeid, bool include_components, ObjectAddresses *addrs); +static void add_type_component_addresses(Oid typeid, ObjectAddresses *addrs); +static void add_class_component_addresses(Oid relid, ObjectAddresses *addrs);static void DeleteInitPrivs(const ObjectAddress *object); For some reason, the function add_object_address() is singular, and these new functions are plural Clearly, they take a plural argument, so your version seems more correct, but I wonder if we should keep that consistent. + * e.g. not to the right thing for column-less tables. Note that Small typo there. (to -> do) @@ -1639,6 +1641,9 @@ find_expr_references_walker(Node *node, add_object_address(OCLASS_PROC, funcexpr->funcid, 0, context->addrs); + /* dependency on type itself already exists via function */ + add_type_component_addresses(funcexpr->funcresulttype, context->addrs); + /* fall through to examine arguments */ } else if (IsA(node, OpExpr)) Why shouldn't the function itself also depend on the components of its return type? How are argument types handled? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2017-03-07 18:46:31 -0500, Peter Eisentraut wrote: > I looked over > 0001-Add-expression-dependencies-on-composite-type-whole-.patch. That > seems useful independent of the things you have following. > > Even though it is presented more as a preparatory change, it appears to > have noticeable user-facing impact, which we should analyze. Indeed. > For > example, in the regression test, the command > > alter table tt14t drop column f3; > > is now rejected, rightly, but the command > > alter table tt14t drop column f3 cascade; > > ends up dropping the whole view, which might be a bit surprising. We > don't support dropping columns from views, so there might be no > alternative It seems strictly better than silently breaking the view. > , but I wonder what else is changing because of this. I > think that might deserve a bit more analysis and perhaps some test cases. There are already some tests. More is probably good - if you have specific cases in mind... > --- a/src/backend/catalog/dependency.c > +++ b/src/backend/catalog/dependency.c > @@ -206,6 +206,9 @@ static bool object_address_present_add_flags(const > ObjectAddress *object, > static bool stack_address_present_add_flags(const ObjectAddress *object, > int flags, > > ObjectAddressStack *stack); > +static void add_type_addresses(Oid typeid, bool include_components, > ObjectAddresses *addrs); > +static void add_type_component_addresses(Oid typeid, ObjectAddresses > *addrs); > +static void add_class_component_addresses(Oid relid, ObjectAddresses > *addrs); > static void DeleteInitPrivs(const ObjectAddress *object); > > For some reason, the function add_object_address() is singular, and > these new functions are plural Clearly, they take a plural argument, so > your version seems more correct, but I wonder if we should keep that > consistent. I named them plural, because add_object_address only adds a single new entry, but add_type_addresses etc potentially add multiple ones. So I think plural/singular as used here is actually consistent? > + * e.g. not to the right thing for column-less > tables. Note that > > Small typo there. (to -> do) Oops. > @@ -1639,6 +1641,9 @@ find_expr_references_walker(Node *node, > > add_object_address(OCLASS_PROC, funcexpr->funcid, 0, > context->addrs); > + /* dependency on type itself already exists via function */ > + add_type_component_addresses(funcexpr->funcresulttype, > context->addrs); > + > /* fall through to examine arguments */ > } > else if (IsA(node, OpExpr)) > > Why shouldn't the function itself also depend on the components of its > return type? Because that'd make it impossible to change the return type components - if the return type is baked into the view that's necessary, but for a "freestanding function" it's not. If you e.g. have a function that just returns a table's rows, it'd certainly be annoying if that'd prevent you from dropping columns. > How are argument types handled? We fall through to return expression_tree_walker(node, find_expr_references_walker, (void *) context); which'll add a dependency if necessary. If it's e.g. a table column, function return type or such we'll add a dependency there. Greetings, Andres Freund
On 3/7/17 19:14, Andres Freund wrote: >> Why shouldn't the function itself also depend on the components of its >> return type? > Because that'd make it impossible to change the return type components - > if the return type is baked into the view that's necessary, but for a > "freestanding function" it's not. If you e.g. have a function that just > returns a table's rows, it'd certainly be annoying if that'd prevent you > from dropping columns. I think functions breaking when the return type is fiddled with is actually a not-uncommon problem in practice. It would be nice if that could be addressed. Not necessarily in this patch, but I would like to keep the options open. Comments from others would be welcome on this. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I have also looked at the 0002 and 0003 patches, and they seem OK, but they are obviously not of much use without 0004. What is your ambition for getting 0004 reviewed and committed for PG10? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-10 08:51:25 -0500, Peter Eisentraut wrote: > On 3/7/17 19:14, Andres Freund wrote: > >> Why shouldn't the function itself also depend on the components of its > >> return type? > > Because that'd make it impossible to change the return type components - > > if the return type is baked into the view that's necessary, but for a > > "freestanding function" it's not. If you e.g. have a function that just > > returns a table's rows, it'd certainly be annoying if that'd prevent you > > from dropping columns. > > I think functions breaking when the return type is fiddled with is > actually a not-uncommon problem in practice. It would be nice if that > could be addressed. What problem are you thinking of exactly, and what'd be the solution? I'd guess that people wouldn't like being unable to change columns in a table if some function returned the type. One rather extreme way would be to have functions return a "different" type, that's initially identical to the table/composite type. If the ocmposite type then changes, we'd transparently map between the actual return type, and the one callers expect. But that'd a fair bit of effort, and presumably also quite confusing for users that might e.g. expect a new column to show up. > Not necessarily in this patch, but I would like to keep the options > open. Yea, seems worth thinking about. I don't think the patch closes down options? Greetings, Andres Freund
On 2017-03-10 09:00:14 -0500, Peter Eisentraut wrote: > I have also looked at the 0002 and 0003 patches, and they seem OK, but > they are obviously not of much use without 0004. What is your ambition > for getting 0004 reviewed and committed for PG10? I'd really like to get it in. The performance improvements on its own are significant, and it provides the basis for significantly larger improvements again (JIT) - those followup improvements are large patches again though, so I'd rather not do all of that next cycle. My next step (over the weekend) is to add tests to execQual.c to get it a good chunk closer to 100% test coverage, and then do the same for the new implementation (there's probably very little additional tests needed after the conversion). Given all tests pass before/after, and there's a lot of them, I think we can have a reasonable confidence of a low bug density. Large parts of the changes are fairly mechanical, the interesting bits probably are: - there's a lot fewer "full blown" node types for expression evaluation anymore. In most cases there's now only a top-levelExprState node that's nodeTag() able. The actual information to execute an instruction now is in ExprState->steps[i] - because of the previous point, it now isn't legal anymore to call ExecInitExpr() on a list of expressions - ExecInitExprList()is a convenience wrapper around manually iterating over the list (gets rid of ugly casts, too) - Because they behave differently on an actual expression evaluation basis, quals / check constraint, now have to be initializedwith ExecInitQual/ExecInitCheck(). That way the shortcut behaviour and such can be retained, while also beingfaster. - PlanState->targetlist is gone. Because expressions aren't initialized on their own anymore, but as part of ExecProject,there's no need for it. - The seperation between ExecInitExprRec() and ExecEvalExpr/ExecInterpExpr() is different - we try to do as much as possibleduring initialization. - I retained some ugly hackering around innermost_caseval|null/domainval|null - I had started down a path of removing theuse of those in random parts of the system, but I ended up not going for it. Greetings, Andres Freund
On 3/10/17 14:24, Andres Freund wrote: > What problem are you thinking of exactly, and what'd be the solution? > I'd guess that people wouldn't like being unable to change columns in a > table if some function returned the type. Well, that's pretty much the problem. For example: create type t1 as (a int, b int); create function f1() returns setof t1 language plpgsql as $$ begin return query values (1, 2); end $$; alter type t1 drop attribute b; select * from f1(); -- fail The dependency system should arguably guard against that. Yes, it would make some things more cumbersome, but, well, it already does that. >> Not necessarily in this patch, but I would like to keep the options >> open. > > Yea, seems worth thinking about. I don't think the patch closes down > options? nope -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/10/17 14:40, Andres Freund wrote: > I'd really like to get it in. The performance improvements on its own > are significant, and it provides the basis for significantly larger > improvements again (JIT) - those followup improvements are large patches > again though, so I'd rather not do all of that next cycle. > > My next step (over the weekend) is to add tests to execQual.c to get it > a good chunk closer to 100% test coverage, and then do the same for the > new implementation (there's probably very little additional tests needed > after the conversion). Given all tests pass before/after, and there's a > lot of them, I think we can have a reasonable confidence of a low bug > density. That seems like a plan, but now would probably be a good time for some other hackers to take note of this proposal. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote: > On 3/10/17 14:40, Andres Freund wrote: > > I'd really like to get it in. The performance improvements on its own > > are significant, and it provides the basis for significantly larger > > improvements again (JIT) - those followup improvements are large patches > > again though, so I'd rather not do all of that next cycle. > > > > My next step (over the weekend) is to add tests to execQual.c to get it > > a good chunk closer to 100% test coverage, and then do the same for the > > new implementation (there's probably very little additional tests needed > > after the conversion). Given all tests pass before/after, and there's a > > lot of them, I think we can have a reasonable confidence of a low bug > > density. > > That seems like a plan, but now would probably be a good time for some > other hackers to take note of this proposal. Well, the executor is long overdue for improvement, so I would love to have this in 10.0. I am not sure what additional polishing will happen if we punt it for 11.0. I think the only downside of having it in 10.0 is that it will not have lived in the source tree for as long a time between commit and PG release, but if it is tested, that seems fine. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Mar 10, 2017 at 7:42 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote: >> On 3/10/17 14:40, Andres Freund wrote: >> > I'd really like to get it in. The performance improvements on its own >> > are significant, and it provides the basis for significantly larger >> > improvements again (JIT) - those followup improvements are large patches >> > again though, so I'd rather not do all of that next cycle. >> > >> > My next step (over the weekend) is to add tests to execQual.c to get it >> > a good chunk closer to 100% test coverage, and then do the same for the >> > new implementation (there's probably very little additional tests needed >> > after the conversion). Given all tests pass before/after, and there's a >> > lot of them, I think we can have a reasonable confidence of a low bug >> > density. >> >> That seems like a plan, but now would probably be a good time for some >> other hackers to take note of this proposal. > > Well, the executor is long overdue for improvement, so I would love to > have this in 10.0. I am not sure what additional polishing will happen > if we punt it for 11.0. I think the only downside of having it in 10.0 > is that it will not have lived in the source tree for as long a time > between commit and PG release, but if it is tested, that seems fine. Yeah. I think we can't rule out the possibility that this patch is full of bugs, but (1) a lot of the change is pretty mechanical and (2) none of this touches the on-disk format or the catalog contents, so it doesn't break anybody's existing install if we have to patch it. On the other hand, (3) it's a big patch and (4) if expression evaluation doesn't work, PostgreSQL is pretty much unusable. So my feeling is: 1. If Andres commits this and it turns out to be seriously buggy, he should be prepared to revert it without a big fight. We can try again for v11. 2. If Andres commits this and it turns out to be mildly buggy, he should be prepared to address bugs very proactively and very promptly. 3. If Andres wants to commit this for v10, it should be done RSN. Feature freeze isn't technically until the end of March, but giant potentialy-destabilizing patches that land on March 31st are likely to lead to outcomes that nobody wants. Like, I think, everyone else, I'd really like to have these speedups and I think they will benefit everyone who uses PostgreSQL. However, I think that we cannot afford to have a repeat of the simplehash thing where serious bugs sat around for months without really getting properly addressed. If that happens with this, I will be extremely unhappy, and I'm fairly sure I won't be alone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 03/07/2017 04:01 AM, Andres Freund wrote: > Hi, > > Attached is an updated version of the patchset, but more importantly > some benchmark numbers. > I wanted to do a bit of testing and benchmarking on this, but 0004 seems to be a bit broken. The patch does not apply anymore - there are some conflicts in execQual.c, but I think I fixed those. But then I ran into a bunch of compile-time errors, because some of the executor nodes still reference bits that were moved elsewhere. E.g. there's no targetlist in PlanState anymore, yet nodeGatherMerge and nodeTableFuncscan do this: gm_state->ps.targetlist = (List *) ExecInitExpr((Expr *) node->plan.targetlist, (PlanState *) gm_state); Some of the nodes also assign to ps.qual values that are (List *), but that field was switched to ExprState. That needs fixing, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote: > I wanted to do a bit of testing and benchmarking on this, but 0004 seems to > be a bit broken. Well, "broken" in the sense that it's already outdated, because other stuff that got merged. > The patch does not apply anymore - there are some conflicts > in execQual.c, but I think I fixed those. But then I ran into a bunch of > compile-time errors, because some of the executor nodes still reference bits > that were moved elsewhere. Updated patch attached. Note that this patch has two changes I've not yet evaluated performance-wise. Questions: - New code calls UpdateDomainConstraintRef at the beginning of expression evaluation, not for each row. That allows to generate "proper programs" for constraint checking, which is quite useful for performance. But it's a behaviour change. - I have *not* changed that, but I'm less than convinced of the pattern of "runtime" get_cached_rowtype calls. Changes: - comments - fix minor bug in BooleanTest brought to ligth thanks to the new regression tests (No, IS TRUE isn't the same IS NOT FALSE) - fix (theoretical?) issue with get_cached_rowtype() being called once for deforming and once for forming a tuple for FieldStores. - fixed/implemented an optimization XXX around bool AND/OR because the solution actually makes the code easier. What's basically missing here is: - pgindent run and minimizing resulting damage - A higher level explanation of how the interpreted expression evaluation works. This patch implements "fairly standard" approach, but it's imo sufficiently - There's one remaining FIXME that needs to be fixed. - I sometimes see a minor performance regression (~1.2%) in tpch-h's Q17 that might or might not be code layout related. - continue pass to ensure all worthwhile comments from execQual.c are retained. Could be delayed: - add regression tests for track_functions - I've manually verified it works, and we previously didn't have tests either. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 13.03.2017 11:03, Andres Freund wrote:
Hi, On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:I wanted to do a bit of testing and benchmarking on this, but 0004 seems to be a bit broken.Well, "broken" in the sense that it's already outdated, because other stuff that got merged.The patch does not apply anymore - there are some conflicts in execQual.c, but I think I fixed those. But then I ran into a bunch of compile-time errors, because some of the executor nodes still reference bits that were moved elsewhere.Updated patch attached. Note that this patch has two changes I've not yet evaluated performance-wise.
I got the following results at my system with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 16Gb RAM,
TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table projection:
Q1 | Q6 | |
Master | 7503 ms | 1171 ms |
Your patch | 6420 ms | 1034 ms |
VOPS | 396 ms | 249 ms |
VOPS + patch | 367 ms | 233 ms |
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 03/13/2017 09:03 AM, Andres Freund wrote: > Hi, > > On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote: >> I wanted to do a bit of testing and benchmarking on this, but 0004 seems to >> be a bit broken. > > Well, "broken" in the sense that it's already outdated, because other > stuff that got merged. > Yes, that's what I meant. Sorry for not being quite clear. > >> The patch does not apply anymore - there are some conflicts >> in execQual.c, but I think I fixed those. But then I ran into a bunch of >> compile-time errors, because some of the executor nodes still reference bits >> that were moved elsewhere. > > Updated patch attached. Note that this patch has two changes I've not > yet evaluated performance-wise. > And which are those? I plan to do a bit of testing / benchmarking on this later this week, depending on other testing already happening on my machine. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2017-03-13 01:03:51 -0700, Andres Freund wrote: > What's basically missing here is: > - pgindent run and minimizing resulting damage Running into a bit of an issue here - pgindent mangles something like EEO_SWITCH (op->opcode) { EEO_CASE(EEO_DONE): goto out; EEO_CASE(EEO_INNER_FETCHSOME): /* XXX: worthwhile to check tts_nvalid inline first? */ slot_getsomeattrs(innerslot,op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_OUTER_FETCHSOME): slot_getsomeattrs(outerslot, op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_SCAN_FETCHSOME): slot_getsomeattrs(scanslot, op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_INNER_VAR): { int attnum = op->d.var.attnum; /* * Can't assert tts_nvalid, as wholerow var evaluation or such * could havematerialized the slot - but the contents are * still valid :/ */ Assert(op->d.var.attnum>= 0); *op->resnull = innerslot->tts_isnull[attnum]; *op->resvalue = innerslot->tts_values[attnum]; EEO_DISPATCH(op); } into EEO_SWITCH(op->opcode) { EEO_CASE(EEO_DONE): goto out; EEO_CASE(EEO_INNER_FETCHSOME): /* XXX: worthwhile to check tts_nvalid inline first? */ slot_getsomeattrs(innerslot,op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_OUTER_FETCHSOME): slot_getsomeattrs(outerslot, op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_SCAN_FETCHSOME): slot_getsomeattrs(scanslot, op->d.fetch.last_var); EEO_DISPATCH(op); EEO_CASE(EEO_INNER_VAR): { int attnum = op->d.var.attnum; /* * Can't assert tts_nvalid, as wholerow var evaluation or such * could have materializedthe slot - but the contents are still * valid :/ */ Assert(op->d.var.attnum >=0); *op->resnull = innerslot->tts_isnull[attnum]; *op->resvalue = innerslot->tts_values[attnum]; EEO_DISPATCH(op); } which is a bit annoying. (the EEO_CASE is either a jump label or a case statement, depending on computed goto availability). It seems we could either: 1) live with the damage 2) disable pgindent 3) move the : inside EEO_CASE's definition, and only use {} blocks. I'm inclined to go for 3). Opinions? - Andres
Hi, On 2017-03-13 22:37:51 +0100, Tomas Vondra wrote: > On 03/13/2017 09:03 AM, Andres Freund wrote: > > Hi, > > > > On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote: > > > I wanted to do a bit of testing and benchmarking on this, but 0004 seems to > > > be a bit broken. > > > > Well, "broken" in the sense that it's already outdated, because other > > stuff that got merged. > > > > Yes, that's what I meant. Sorry for not being quite clear. And the patch already again has conflicts. Updated version, with a good chunk of other changes, attached. > > > > > The patch does not apply anymore - there are some conflicts > > > in execQual.c, but I think I fixed those. But then I ran into a bunch of > > > compile-time errors, because some of the executor nodes still reference bits > > > that were moved elsewhere. > > > > Updated patch attached. Note that this patch has two changes I've not > > yet evaluated performance-wise. > > > > And which are those? Primarily around minor performance regressions for the evaluation of very simple (i.e. single Var/Const) expressions. I did a fair amount of work hunting these down. Besides that, this version has: - pgindented most of the affected pieces (i.e. all significant new code has been reindent, not all touched one) - Transported missing comments from execQual.c - afaics all relevant ones have been moved over - Replaced ExprState->cb with evalfunc - that's what it was named previously, and I see no reason to diverge. - significantly expanded header comments, especially in execInterpExpr.c, explaining the basic approach of expression evaluation. Feedback here would be greatly appreciated - I've spent so much time neck deep into this, that I don't quite know in how much detail to go. - Initial polished commit messages and such. My current plan is to not do very much on this tomorrow, do another full pass on Wednesday, and push it, unless there's protest till then. > I plan to do a bit of testing / benchmarking on this later this week, > depending on other testing already happening on my machine. Cool. I've performed a bunch more benchmarks, and the results have been pretty favorable. Due to some more work hunting down minor regressions I can't make any tpc-h query perform worse than on master, with either best/average of three executions. I also did a bunch of runs with parallelism, and the results there are also pretty good - even better than non-parallel sometimes. These results include parallelism, but were done *before* rebasing ontop of Noah's changes. There was also, during all runs, a browser and emacs running, so it's certainly possible that there's some skew. -par is with parallelism enabled to 8, with is without. Otherwise same setup as two mails up. master-par q01 min: 4629.643 master min: 15108.813 [diff +69.36] dev min: 12629.0 [diff +63.34] dev-par min: 3458.899 [diff-33.85] master-par q02 min: 1048.511 master min: 1196.596 [diff +12.38] dev min: 1182.652 [diff +11.34] dev-par min: 1044.876 [diff-0.35] master-par q03 min: 4651.492 master min: 6949.525 [diff +33.07] dev min: 6441.467 [diff +27.79] dev-par min: 3928.919 [diff-18.39] master-par q04 min: 698.037 master min: 981.081 [diff +28.85] dev min: 947.506 [diff +26.33] dev-par min: 341.855 [diff -104.19] master-par q05 min: 871.528 master min: 5119.728 [diff +82.98] dev min: 4979.498 [diff +82.50] dev-par min: 702.3 [diff -24.10] master-par q06 min: 1107.307 master min: 1807.433 [diff +38.74] dev min: 1694.513 [diff +34.65] dev-par min: 738.037 [diff-50.03] master-par q07 min: 4867.412 master min: 4745.141 [diff -2.58] dev min: 4682.068 [diff -3.96] dev-par min: 4207.644 [diff-15.68] master-par q08 min: 432.159 master min: 1465.219 [diff +70.51] dev min: 1416.691 [diff +69.50] dev-par min: 418.331 [diff-3.31] master-par q09 min: 3965.76 master min: 6441.099 [diff +38.43] dev min: 6220.095 [diff +36.24] dev-par min: 3875.673 [diff-2.32] master-par q10 min: 1129.351 master min: 5818.318 [diff +80.59] dev min: 5406.125 [diff +79.11] dev-par min: 1107.759 [diff-1.95] master-par q11 min: 247.598 master min: 291.061 [diff +14.93] dev min: 277.18 [diff +10.67] dev-par min: 241.983 [diff -2.32] master-par q12 min: 1532.855 master min: 4552.034 [diff +66.33] dev min: 4431.584 [diff +65.41] dev-par min: 1395.703 [diff-9.83] master-par q13 min: 8638.503 master min: 8009.64 [diff -7.85] dev min: 7891.661 [diff -9.46] dev-par min: 8314.437 [diff-3.90] master-par q14 min: 723.878 master min: 764.35 [diff +5.29] dev min: 730.526 [diff +0.91] dev-par min: 717.66 [diff -0.87] master-par q15 min: 1991.915 master min: 1809.162 [diff -10.10] dev min: 1716.983 [diff -16.01] dev-par min: 1879.927 [diff-5.96] master-par q16 min: 1596.643 master min: 1843.553 [diff +13.39] dev min: 1728.467 [diff +7.63] dev-par min: 1429.103 [diff-11.72] master-par q17 min: 354.357 master min: 418.761 [diff +15.38] dev min: 414.958 [diff +14.60] dev-par min: 322.471 [diff -9.89] master-par q18 min: 11707.479 master min: 14216.567 [diff +17.65] dev min: 14206.893 [diff +17.59] dev-par min: 11642.863[diff -0.55] master-par q19 min: 116.819 master min: 325.531 [diff +64.11] dev min: 287.497 [diff +59.37] dev-par min: 102.909 [diff -13.52] master-par q20 min: 430.85 master min: 500.92 [diff +13.99] dev min: 478.536 [diff +9.96] dev-par min: 410.548 [diff -4.95] master-par q22 min: 460.712 master min: 626.207 [diff +26.43] dev min: 595.708 [diff +22.66] dev-par min: 434.851 [diff -5.95] Thees seem like pretty neat results. Sorry for not cleaning up the representation - too tired. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Andres Freund wrote: > EEO_SWITCH(op->opcode) > { > EEO_CASE(EEO_DONE): > goto out; Oh my. > which is a bit annoying. (the EEO_CASE is either a jump label or a case > statement, depending on computed goto availability). > > It seems we could either: > 1) live with the damage > 2) disable pgindent > 3) move the : inside EEO_CASE's definition, and only use {} blocks. I think 3) is nasty because the result doesn't look like normal C. If EEO_CASE() are potentially jump labels, then indentation becomes correct. Why not accept it? It looks odd, but that gives a better hint to the reader about what's going on. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index 3d6a3801c0..d205101b89 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -47,7 +47,14 @@#include "utils/rel.h" -static bool get_last_attnums(Node *node, ProjectionInfo *projInfo); +typedef struct LastLastAttnumInfo LastLast? Patch 0003 is huge. It would be good to have someone at least read it before pushing, but I don't think anyone other than you has done so. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/14/2017 08:53 AM, Andres Freund wrote: > Besides that, this version has: > - pgindented most of the affected pieces (i.e. all significant new code > has been reindent, not all touched one) I think you'll need to add all the inner structs ExprEvalStep typedefs.list to indent them right. > My current plan is to not do very much on this tomorrow, do another full > pass on Wednesday, and push it, unless there's protest till then. I looked at patch 0004. Some comments: * EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really assumes that 'op' has already been set to point to the jump target. I find that a bit weird. I guess the idea is that you always pass the Program Counter variable as 'op' argument. For consistency, would be good if EEO_SWITCH() also took just 'op' as the argument, rather than op->opcode. But I think it would be more clear if they should both just assumed that there's a variable called 'op' that points to the current instruction. * All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to calling EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step number>), to encapsulate setting 'op' and jumping to it in one operation? * ExecEvalStepOp() seems relatively expensive, with the linear scan over all the opcodes, if called on an ExprState that already has EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the opcode is a particular one, so you could check if the opcode matches that particular one, instead of scanning the dispatch table to find what it is. * But is ExecEvalStepOp() ever actually get called on an expression with EEO_FLAG_JUMP_THREADED already set? It's only used in ExecInstantiateInterpretedExpr(), and it's called only once on each ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED is not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and assert that evalfunc != ExecInterpExpr. * How tight are we on space in the ExprEvalStep union? Currently, the jump-threading preparation replaces the opcodes with the goto labels, but it would be really nice to keep the original opcodes, for debugging purposes if nothing else. * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c? Attached is a patch, on top of your 0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with some minor tweaks and comments here and there (search for HEIKKI). It's mostly the same stuff I listed above, implemented in a quick & dirty way. I hope it's helpful. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote: > On 03/14/2017 08:53 AM, Andres Freund wrote: > > Besides that, this version has: > > - pgindented most of the affected pieces (i.e. all significant new code > > has been reindent, not all touched one) > > I think you'll need to add all the inner structs ExprEvalStep typedefs.list > to indent them right. Yea, I saw that. Since they're not named atm, That'd probably not work well. I'd be inclined to just live with it :/ > I looked at patch 0004. Thanks! > * EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really assumes > that 'op' has already been set to point to the jump target. I find that a > bit weird. I guess the idea is that you always pass the Program Counter > variable as 'op' argument. For consistency, would be good if EEO_SWITCH() > also took just 'op' as the argument, rather than op->opcode. But I think it > would be more clear if they should both just assumed that there's a variable > called 'op' that points to the current instruction. Hm. I dislike magic variable names. I think I prefer your idea of passing the op entirely to EEO_SWITCH. > * All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to calling > EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step number>), > to encapsulate setting 'op' and jumping to it in one operation? Ok. > * ExecEvalStepOp() seems relatively expensive, with the linear scan over all > the opcodes, if called on an ExprState that already has > EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the opcode is > a particular one, so you could check if the opcode matches that particular > one, instead of scanning the dispatch table to find what it is. It should be fairly cheap at that place, unless the same expression is instantiated twice for some reason. I have been wondering about adding a second table ptr->op that can be binary searched for the operation to make it cheaper. > * But is ExecEvalStepOp() ever actually get called on an expression with > EEO_FLAG_JUMP_THREADED already set? It's only used in > ExecInstantiateInterpretedExpr(), and it's called only once on each > ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED is > not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and assert > that evalfunc != ExecInterpExpr. The primary use of ExecEvalStepOp() is really debugging, so I'd like to avoid adding that restriction. I guess we can add a second function for this if needed. > * How tight are we on space in the ExprEvalStep union? Currently, the > jump-threading preparation replaces the opcodes with the goto labels, but it > would be really nice to keep the original opcodes, for debugging purposes if > nothing else. There's nothing left to spare :(. For debugging I've just used ExecEvalStepOp() - it's inconvenient to directly print the struct anyway, since gdb prints the whole union... > * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c? > Attached is a patch, on top of your > 0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with some > minor tweaks and comments here and there (search for HEIKKI). It's mostly > the same stuff I listed above, implemented in a quick & dirty way. I hope > it's helpful. Thanks! > typedef enum ExprEvalOp > { > + /* > + * HEIKKI: How about renaming these to EEOP_* ? I think it'd be a > + * good mnemonic to remember that these are opcodes, and just one > + * extra letter. > + */ I don't mind the change. > + /* HEIKKI: Is it safe to assume that sizeof(size_t) >= sizeof(void *) ? */ Yes, seems very likely - it's supposed to hold any potential sizeof() return value. You could end up on platforms where that's not true, if it used tagged pointers or such. I guess we could instead use a union, but that doesn't seem that much of a benefit. Greetings, Andres Freund
On 2017-03-14 08:28:02 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > EEO_SWITCH(op->opcode) > > { > > EEO_CASE(EEO_DONE): > > goto out; > > Oh my. > > > which is a bit annoying. (the EEO_CASE is either a jump label or a case > > statement, depending on computed goto availability). > > > > It seems we could either: > > 1) live with the damage > > 2) disable pgindent > > 3) move the : inside EEO_CASE's definition, and only use {} blocks. > > I think 3) is nasty because the result doesn't look like normal C. We have a bunch of such constructs already however. foreach(), PG_TRY(). That means 3) has the advantage that our editors and pgindent can already deal with it. > EEO_CASE() are potentially jump labels, then indentation becomes > correct. Why not accept it? It looks odd, but that gives a better hint > to the reader about what's going on. Seems likely that every editor would indent them differently :( I'm leaning towards 3) for the reasons above and after trying out the different indentations, but I don't have a strong opinion. Greetings, Andres Freund
On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote: > Patch 0003 is huge. I suspect you mean 0004? If so - yes :(. I unfortunately don't think there's a useful way to split it in smaller chunks - I originally moved ops over one-by-one, but that required a lot of duplicated structs and such... > It would be good to have someone at least read it before pushing, but > I don't think anyone other than you has done so. I'd love for somebody else to look through it, I tried asking multiple times... Seems like threatening a commit is the best way to get it :P - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote: >> It would be good to have someone at least read it before pushing, but >> I don't think anyone other than you has done so. > I'd love for somebody else > to look through it, I tried asking multiple times... Seems like > threatening a commit is the best way to get it :P I'm willing to look, but the last few messages make it sound like you're not all that close to a finished version. regards, tom lane
On 03/14/2017 07:31 PM, Andres Freund wrote: > On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote: >> * How tight are we on space in the ExprEvalStep union? Currently, the >> jump-threading preparation replaces the opcodes with the goto labels, but it >> would be really nice to keep the original opcodes, for debugging purposes if >> nothing else. > > There's nothing left to spare :(. For debugging I've just used > ExecEvalStepOp() - it's inconvenient to directly print the struct > anyway, since gdb prints the whole union... This comment above the union is not accurate: > /* > * Data for an operation. Inline stored data is faster to access, but also > * bloats the size of all instructions. The union should be kept below 48 > * bytes (so the entire struct is below 64bytes, a single cacheline on > * common systems). > */ On my x86-64 system, the whole struct is 64 bytes, but the union is only 40 bytes. The fields before the union occupy 24 bytes. IOW, the union should be kept below 40 bytes, not 48. Hmm. Could we make the instructions variable size? It would allow packing the small instructions even more tight, and we wouldn't need to obsess over a particular maximum size for more complicated instructions. A compromise might be to have the fixed size, but have "continuation" opcodes for the more complicated instructions. An XmlExpr instruction, for example, could have an extra instruction after the primary one, just to hold more fields. When evaluating it, we would just increment the Program Counter by two instead of one, to skip over the extra instruction. For reference, here are the sizes of all the structs in the union: 40: xmlexpr 40: scalararrayop 40: minmax 40: iocoerce 40: convert_rowtype 32: wholerow 32: rowcompare_step 32: row 32: func 32: fieldstore 32: domaincheck 32: boolexpr 32: arrayexpr 32: arraycoerce 24: casewhen 24: casethen 24: arrayref_checksubscript 16: grouping_func 16: fieldselect 16: constval 16: casetest 16: assign_var 16: arrayref 8: window_func 8: subplan 8: sqlvaluefunction 8: param 8: nulltest_row 8: assign_tmp 8: alternative_subplan 8: aggref 4: var 4: rowcompare_final 4: qualexpr 4: fetch 4: coalesce - Heikki
Hi, On 2017-03-14 20:28:51 +0200, Heikki Linnakangas wrote: > On 03/14/2017 07:31 PM, Andres Freund wrote: > > On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote: > > > * How tight are we on space in the ExprEvalStep union? Currently, the > > > jump-threading preparation replaces the opcodes with the goto labels, but it > > > would be really nice to keep the original opcodes, for debugging purposes if > > > nothing else. > > > > There's nothing left to spare :(. For debugging I've just used > > ExecEvalStepOp() - it's inconvenient to directly print the struct > > anyway, since gdb prints the whole union... > > This comment above the union is not accurate: > > > /* > > * Data for an operation. Inline stored data is faster to access, but also > > * bloats the size of all instructions. The union should be kept below 48 > > * bytes (so the entire struct is below 64bytes, a single cacheline on > > * common systems). > > */ > > On my x86-64 system, the whole struct is 64 bytes, but the union is only 40 > bytes. The fields before the union occupy 24 bytes. IOW, the union should be > kept below 40 bytes, not 48. Point. Will update. > Hmm. Could we make the instructions variable size? It would allow packing > the small instructions even more tight, and we wouldn't need to obsess over > a particular maximum size for more complicated instructions. That makes jumps a lot more complicated. I'd experimented with it and given it up as "not worth it". If we were to try to do so, we'd also not like storing the pointer and enum variants both, since it'd again would reduce the density. > A compromise might be to have the fixed size, but have "continuation" > opcodes for the more complicated instructions. An XmlExpr instruction, for > example, could have an extra instruction after the primary one, just to hold > more fields. When evaluating it, we would just increment the Program Counter > by two instead of one, to skip over the extra instruction. I think for cases where 40 bytes becomes the limit, it's usually better to have out-of-line state. We could do this, but we'd also waste a bunch of space in the array (namely the 24 byte Op "header"). Greetings, Andres Freund
Andres, sorry I haven't had a chance to look at this great stuff you've been doing. I've wanted to get to it, but work keeps getting in the way. ;-)
I do have one observation based on my experiments with your first version of the code. In my tests, I found that expression init becomes a lot more expensive in this new model. (That's neither a surprise, nor a concern.) In particular, the function ExprEvalPushStep() is quite hot. In my code I made the following changes:
I do have one observation based on my experiments with your first version of the code. In my tests, I found that expression init becomes a lot more expensive in this new model. (That's neither a surprise, nor a concern.) In particular, the function ExprEvalPushStep() is quite hot. In my code I made the following changes:
* Declare ExprEvalPushStep() "inline".
* Remove the "if (es->steps_alloc == 0)" condition from ExprEvalPushStep().
* In ExecInitExpr(), add:
state->steps_alloc = 16;
state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc);
I found that this cut the cost of initializing the expression by about 20%. (Of course, that was on version 1 of your code, so the benefit may well be different now.)
On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote:
> Hmm. Could we make the instructions variable size? It would allow packing
> the small instructions even more tight, and we wouldn't need to obsess over
> a particular maximum size for more complicated instructions.
That makes jumps a lot more complicated. I'd experimented with it and
given it up as "not worth it".
Back when I was at IBM, I spent a lot of time doing stuff like this. If you want to commit with the fixed size arrays, I'm happy to volunteer to look at packing it tighter as a follow-on piece of work. (It was already on my list of things to try anyhow.)
If we were to try to do so, we'd also
not like storing the pointer and enum variants both, since it'd again
would reduce the density.
From my experience, it's worth the small loss in density to carry around both the pointer and the enum - it makes debugging so much easier.
- Doug
Salesforce
Hi, On 2017-03-14 22:03:45 +0000, Douglas Doole wrote: > I do have one observation based on my experiments with your first version > of the code. In my tests, I found that expression init becomes a lot more > expensive in this new model. (That's neither a surprise, nor a > concern.) I suspect that's to a good degree because back then it'd often end up trying the "new" stuff, but then fall back to the old machinery due to unsupported operations. Which isn't a concern anymore, because now there's full coverage. Otherwise the cost aren't, according to my measurements, higher than before, excepting that some stuff that happened at execution time is now happening during initialization. > In particular, the function ExprEvalPushStep() is quite hot. In my > code I made the following changes: > > * Declare ExprEvalPushStep() "inline". > * Remove the "if (es->steps_alloc == 0)" condition > from ExprEvalPushStep(). > * In ExecInitExpr(), add: > state->steps_alloc = 16; > state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc); > > I found that this cut the cost of initializing the expression by about 20%. > (Of course, that was on version 1 of your code, so the benefit may well be > different now.) Hm. Right now ExprState's are allocated in several places - but we could easily move that to a central place. Have a bit of a hard time seing that that branch during *initialization* time is that expensive, especially given that previously we allocated a lot of memory separately too. > On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hmm. Could we make the instructions variable size? It would allow packing > > > the small instructions even more tight, and we wouldn't need to obsess > > over > > > a particular maximum size for more complicated instructions. > > > > That makes jumps a lot more complicated. I'd experimented with it and > > given it up as "not worth it". > Back when I was at IBM, I spent a lot of time doing stuff like this. If you > want to commit with the fixed size arrays, I'm happy to volunteer to look > at packing it tighter as a follow-on piece of work. (It was already on my > list of things to try anyhow.) Yea, I think experimenting with that is a good idea. I just think this is complicated enough, and I don't want to add a whole lot more complexity for very debatable benefit. > > If we were to try to do so, we'd also > > not like storing the pointer and enum variants both, since it'd again > > would reduce the density. > From my experience, it's worth the small loss in density to carry around > both the pointer and the enum - it makes debugging so much easier. I found it annoying enough to print the instruction itself, due to the union - so printing the opcode using a function isn't too bad... Greetings, Andres Freund
On Tue, Mar 14, 2017 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
Hm. Right now ExprState's are allocated in several places - but we
could easily move that to a central place. Have a bit of a hard time
seing that that branch during *initialization* time is that expensive,
especially given that previously we allocated a lot of memory separately
too.
I didn't make any comparisons of the cost of the new init against the old init with this change in particular - I just saw that it made the new init faster. I also didn't play around to determine if the savings was found in removing the branch misprediction or inlining or both.
I certainly wouldn't hold up your commit for this, but it's something that might be worth a second look once the dust has settled.
Hi, On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote: > * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c? I applied most of your changes in the attached version. Notes: * I quite like how the EEO_JUMP/EEO_NEXT stuff looks now / after your change, that's clearly better. * I reverted the removal of EEO_SWITCH() - pgindent goes a bit bonkers if it can see the switch(), but not the case: statements. * I re-added ExecEvalStepOp() - for one it's hugely useful during debugging, for another I'm actually using it in a follow patch (So JIT compilation can run after the interpreter has been initialized). Added code guaranteeing that we can't run ExecInstantiateInterpretedExpr() twice. * Renamed EEO_* opcode enum members to EEOP_* as suggested * I've not yet renamed execInterpExpr.c - I don't like execInterpreter.c because it doesn't reference expressions and execExprInterpreter.c seems a bit long - but I can live with both (preferring the latter). Not that surprisingly I can also live with execInterpExpr.c ;) * Addressed your size_t sizing concern by using, as is proper anyway, intptr_t. * I tried to clarify the comment in execExpr.c's header that you marked as hard to understand. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-03-14 14:19:18 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote: > >> It would be good to have someone at least read it before pushing, but > >> I don't think anyone other than you has done so. > > > I'd love for somebody else > > to look through it, I tried asking multiple times... Seems like > > threatening a commit is the best way to get it :P > > I'm willing to look Cool. > but the last few messages make it sound like you're not all that close > to a finished version. Hm, doesn't seem that far off to me - the latest few rounds of changes have been largely polishing (naming, fighting w/ pgindent, language polishing in comments, etc), otherwise I've basically just added a few lines of optimization after doing more profiling. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > [ new patch versions ] About to leave, but I had time to read 0003: It seems bizarre that you chose to spell the new configure symbol as HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO, especially so when the comment for PGAC_C_COMPUTED_GOTO alleges it's the latter. Also, being a neatnik, I would insert both the definition and the call of that macro between PGAC_C_BUILTIN_UNREACHABLE and PGAC_C_VA_ARGS, keeping it more or less in alphabetical order. regards, tom lane
Hi, On 2017-03-14 19:34:12 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > [ new patch versions ] > > About to leave, but I had time to read 0003: > > It seems bizarre that you chose to spell the new configure symbol as > HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO I went back-and-forth about this a number of times. We have a bunch of symbols defined with HAVE__ as a prefix (and some with HAVE_GCC__) - and more of the nearby code seems to use __ rather than _. I don't really know why we started doing that, but it's far from new.. Any idea why we introduce __ stuff? > Also, being a neatnik, I would insert both the definition and the > call of that macro between PGAC_C_BUILTIN_UNREACHABLE and > PGAC_C_VA_ARGS, keeping it more or less in alphabetical order. That works for me. Thanks, Andres
On 3/14/17 19:40, Andres Freund wrote: > Any idea why we introduce __ stuff? Because the symbols start with an underscore: /* Define to 1 if your compiler understands _Static_assert. */ #undef HAVE__STATIC_ASSERT There is apparently some inconsistency when symbols start with more than one underscore. But we don't need to introduce underscores that are not there. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-14 23:10:25 -0400, Peter Eisentraut wrote: > On 3/14/17 19:40, Andres Freund wrote: > > Any idea why we introduce __ stuff? > > Because the symbols start with an underscore: > > /* Define to 1 if your compiler understands _Static_assert. */ > #undef HAVE__STATIC_ASSERT Oh, I guess that makes some sense. But we do so very inconsistently, given we only keep the leading underscores not the trailing ones (see HAVE__VA_ARGS, HAVE_FUNCNAME__FUNC). > But we don't need to introduce underscores that are not there. Indeed. Don't want to repost for just this, but consider it adapted (and moved as requested by Tom). - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-14 19:34:12 -0400, Tom Lane wrote: >> It seems bizarre that you chose to spell the new configure symbol as >> HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO > I went back-and-forth about this a number of times. We have a bunch of > symbols defined with HAVE__ as a prefix (and some with HAVE_GCC__) - and > more of the nearby code seems to use __ rather than _. I don't really > know why we started doing that, but it's far from new.. > Any idea why we introduce __ stuff? The nearby stuff is describing features that have a specific name that includes a leading underscore or two, like __builtin_unreachable(). That doesn't seem to apply here, so I wouldn't add extra underscores. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > [ latest patches ] I looked through 0002-Make-get_last_attnums-more-generic.patch. Although it seems relatively unobjectionable on its own, I'm not convinced that it's really useful to try to split it out like this. I see that 0004 removes the only call of ExecGetLastAttnums (the one in ExecBuildProjectionInfo) and then adds a single call in ExecInitExprSlots which is in execExpr.c. To me, the only reason ExecGetLastAttnums nee get_last_attnums is in execUtils.c in the first place is that it is a private subroutine of ExecBuildProjectionInfo. After these changes, it might as well be a private subroutine of ExecInitExprSlots. I'm suspicious of turning it into a globally accessible function as you've done here, because I doubt that it is of global use --- in particular, the fact that it doesn't deal specially with INDEX_VAR Vars seems rather specific to this one use-case. So for my money, you should drop 0002 altogether and just have 0004 remove get_last_attnums() from execUtils.c and stick it into execExpr.c. I suppose you still need the LastAttnumInfo API change so as to decouple it from ProjectionInfo, but that's minor. regards, tom lane
On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> [ latest patches ] > >I looked through 0002-Make-get_last_attnums-more-generic.patch. >So for my money, you should drop 0002 altogether and just have 0004 >remove get_last_attnums() from execUtils.c and stick it into >execExpr.c. I suppose you still need the LastAttnumInfo API change >so as to decouple it from ProjectionInfo, but that's minor. I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor, and it'snoticeable performance wise. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So for my money, you should drop 0002 altogether and just have 0004 >> remove get_last_attnums() from execUtils.c and stick it into >> execExpr.c. I suppose you still need the LastAttnumInfo API change >> so as to decouple it from ProjectionInfo, but that's minor. > I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor, and it'snoticeable performance wise. Color me dubious. Which specific other places have you got in mind, and do they have expression trees at hand that would tell them which columns they really need to pull out? regards, tom lane
On 2017-03-15 15:41:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> So for my money, you should drop 0002 altogether and just have 0004 > >> remove get_last_attnums() from execUtils.c and stick it into > >> execExpr.c. I suppose you still need the LastAttnumInfo API change > >> so as to decouple it from ProjectionInfo, but that's minor. > > > I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor, andit's noticeable performance wise. > > Color me dubious. Which specific other places have you got in mind, and > do they have expression trees at hand that would tell them which columns > they really need to pull out? I was thinking of execGrouping.c's execTuplesMatch(), TupleHashTableHash() (and unequal, but doubt that matters performancewise). There's also nodeHash.c's ExecHashGetValue(), but I think that'd possibly better fixed differently. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 15:41:22 -0400, Tom Lane wrote: >> Color me dubious. Which specific other places have you got in mind, and >> do they have expression trees at hand that would tell them which columns >> they really need to pull out? > I was thinking of execGrouping.c's execTuplesMatch(), > TupleHashTableHash() (and unequal, but doubt that matters > performancewise). There's also nodeHash.c's ExecHashGetValue(), but I > think that'd possibly better fixed differently. The execGrouping.c functions don't have access to an expression tree instructing them which columns to pull out of the tuple, so I fail to see how get_last_attnums() would be of any use to them. As for ExecHashGetHashValue, it's most likely going to be working from virtual tuples passed up to the join, which won't benefit from predetermination of the last column to be accessed. The tuple-deconstruction would have happened while projecting in the scan node below. regards, tom lane
On 2017-03-15 16:07:14 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-15 15:41:22 -0400, Tom Lane wrote: > >> Color me dubious. Which specific other places have you got in mind, and > >> do they have expression trees at hand that would tell them which columns > >> they really need to pull out? > > > I was thinking of execGrouping.c's execTuplesMatch(), > > TupleHashTableHash() (and unequal, but doubt that matters > > performancewise). There's also nodeHash.c's ExecHashGetValue(), but I > > think that'd possibly better fixed differently. > > The execGrouping.c functions don't have access to an expression tree > instructing them which columns to pull out of the tuple, so I fail to see > how get_last_attnums() would be of any use to them. I presume most of the callers do. We'd have to change the API somewhat, unless we just have a small loop in execTuplesMatch() determining the biggest column index (which might be worthwhile / acceptable). TupleHashTableHash() should be able to have that pre-computed in BuildTupleHashTable(). Might be more viable to go that way. > As for ExecHashGetHashValue, it's most likely going to be working from > virtual tuples passed up to the join, which won't benefit from > predetermination of the last column to be accessed. The > tuple-deconstruction would have happened while projecting in the scan > node below. I think the physical tuple stuff commonly thwarts that argument? On master for tpch's Q5 you can e.g. see the following profile (master): + 29.38% postgres postgres [.] ExecScanHashBucket + 16.72% postgres postgres [.] slot_getattr + 5.51% postgres postgres [.] heap_getnext - 5.50% postgres postgres [.] slot_deform_tuple - 98.07% slot_deform_tuple - 85.98% slot_getattr - 96.59% ExecHashGetHashValue - ExecHashJoin - ExecProcNode + 85.12% ExecHashJoin + 14.88% MultiExecHash + 3.41% ExecMakeFunctionResultNoSets + 14.02% slot_getsomeattrs + 1.58% ExecEvalScalarVarFast I.e. nearly all calls for slot_deform_tuple are from slot_getattrs in ExecHashGetHashValue(). And nearly all the time in slot_getattr is spent on code only executed for actual tuples: │ if (tuple == NULL) /* internal error */ 0.18 │ test %rax,%rax │ ↓ je 223 │ * │ * (We have to check this separately because of variousinheritance and │ * table-alteration scenarios: the tuple could be either longer or shorter │ * than the tupdesc.) │ */ │ tup = tuple->t_data; 0.47 │ mov 0x10(%rax),%rsi │ if (attnum > HeapTupleHeaderGetNatts(tup))75.42 │ movzwl 0x12(%rsi),%eax0.70 │ and $0x7ff,%eax 0.47 │ cmp %eax,%ebx │ ↓ jg e8 - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 16:07:14 -0400, Tom Lane wrote: >> As for ExecHashGetHashValue, it's most likely going to be working from >> virtual tuples passed up to the join, which won't benefit from >> predetermination of the last column to be accessed. The >> tuple-deconstruction would have happened while projecting in the scan >> node below. > I think the physical tuple stuff commonly thwarts that argument? On > master for tpch's Q5 you can e.g. see the following profile (master): Hmmm ... I think you're mistaken in fingering the physical-tuple optimization per se, but maybe skipping ExecProject at the scan level would cause this result? I've thought for some time that it was dumb to have the executor reverse-engineering this info at plan startup anyway. We could make the planner mark each table scan node with the highest column number that the plan will access, and use that to drive a slot_getsomeattrs call in advance of any access to tuple contents. regards, tom lane
On 2017-03-15 17:33:46 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-15 16:07:14 -0400, Tom Lane wrote: > >> As for ExecHashGetHashValue, it's most likely going to be working from > >> virtual tuples passed up to the join, which won't benefit from > >> predetermination of the last column to be accessed. The > >> tuple-deconstruction would have happened while projecting in the scan > >> node below. > > > I think the physical tuple stuff commonly thwarts that argument? On > > master for tpch's Q5 you can e.g. see the following profile (master): > > Hmmm ... I think you're mistaken in fingering the physical-tuple > optimization per se, but maybe skipping ExecProject at the scan level > would cause this result? I think those are often related (i.e. we replace a smaller targetlist with a "physical" one, which then allows to skip ExecProject()). > I've thought for some time that it was dumb to have the executor > reverse-engineering this info at plan startup anyway. Yea, it'd be good if this (and some similar tasks like building interim tuple descriptors) could be moved to the planner. But: > We could make the planner mark each table scan node with the highest > column number that the plan will access, and use that to drive a > slot_getsomeattrs call in advance of any access to tuple contents. probably isn't sufficient - we build non-virtual tuples in a good number of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c output, ...). I suspect it'd have measurable negative consequences if we removed the deforming logic for all expressions/projections above such nodes. I guess we could just do such a logic for every Plan node? - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 17:33:46 -0400, Tom Lane wrote: >> We could make the planner mark each table scan node with the highest >> column number that the plan will access, and use that to drive a >> slot_getsomeattrs call in advance of any access to tuple contents. > probably isn't sufficient - we build non-virtual tuples in a good number > of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c > output, ...). I suspect it'd have measurable negative consequences if > we removed the deforming logic for all expressions/projections above > such nodes. I guess we could just do such a logic for every Plan node? [ scratches head... ] What deforming logic do you think I'm proposing removing? regards, tom lane
On 2017-03-15 18:16:57 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-15 17:33:46 -0400, Tom Lane wrote: > >> We could make the planner mark each table scan node with the highest > >> column number that the plan will access, and use that to drive a > >> slot_getsomeattrs call in advance of any access to tuple contents. > > > probably isn't sufficient - we build non-virtual tuples in a good number > > of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c > > output, ...). I suspect it'd have measurable negative consequences if > > we removed the deforming logic for all expressions/projections above > > such nodes. I guess we could just do such a logic for every Plan node? > > [ scratches head... ] What deforming logic do you think I'm proposing > removing? I thought you were suggesting that we don't do the get_last_attnums (and inlined version in the isSimpleVar case) at execution time anymore, instead relying on logic in the planner to know how much to deform ahead of time. Then we'd do slot_getsomeattrs in the appropriate places. But I understood you suggesting to do so only in scan nodes - which doesn't seem sufficient, due to the use of materialized / minimal tuples in other types of nodes. Did I misunderstand? - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 18:16:57 -0400, Tom Lane wrote: >> [ scratches head... ] What deforming logic do you think I'm proposing >> removing? > I thought you were suggesting that we don't do the get_last_attnums (and > inlined version in the isSimpleVar case) at execution time anymore, > instead relying on logic in the planner to know how much to deform ahead > of time. Then we'd do slot_getsomeattrs in the appropriate places. But > I understood you suggesting to do so only in scan nodes - which doesn't > seem sufficient, due to the use of materialized / minimal tuples in > other types of nodes. Did I misunderstand? We would need to do it anywhere that we might be projecting from a materialized tuple, I suppose. From the planner's standpoint it would be about as easy to do this for all plan nodes as only selected ones. Anyway the core point here seems to be that skipping ExecProject misses a bet because the underlying tuple doesn't get disassembled. A quick-hack way of seeing if this helps might be to do slot_getallattrs in the places where we skip ExecProject. I'm not sure we'd want that as a permanent solution, because it's possible that it extracts columns not actually needed, but it should at least move the hotspot in your example case. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > [ new patches ] I've started to look at 0004, and the first conclusion I've come to is that it's *direly* short of documentation. To the point that I'd vote against committing it if something isn't done about that. As an example, it's quite unclear how ExprEvalSteps acquire correct resnull/resvalue pointers, and the algorithm for that seems nontrivial. It doesn't help any that the arguments of ExecInitExprRec are entirely undocumented. I think it would be worth creating a README file giving an overview of how all of this patch is supposed to work. You also need to do a whole lot more work on the function-level comments. A specific thing I noticed in the particular area of what-gets-returned-where is this bit in EEOP_SCALARARRAYOP setup: + /* + * Evaluate array argument into our return value, overwrite + * with comparison results afterwards. + */ + ExecInitExprRec((Expr *) lsecond(opexpr->args), parent, state, + resv, resnull); That scares me quite a bit, because it smells exactly like the sort of premature optimization that bit us on the rear in CVE-2016-5423 (cf commit f0c7b789a). What's it really buying us to overwrite the return value early rather than storing into the fcinfo's second argument slot? (The memory of that CVE is part of what's prompting me to demand a clear explanation of the algorithm for deciding where steps return their results. Even if this particular code is safe, somebody is going to do something unsafe in future if there's not a documented rule to follow.) Another thing that ties into the do-I-understand-this-at-all question is this bit: + EEO_CASE(EEOP_BOOL_AND_STEP_FIRST) + { + *op->d.boolexpr.anynull = false; + + /* + * Fallthrough (can't be last - ANDs have two arguments at least). + */ + } + + EEO_CASE(EEOP_BOOL_AND_STEP) It seems like this is missing an "op++;" before falling through. If it isn't, because really BOOL_AND_STEP_FIRST is defined as clearing anynull and then also doing a regular BOOL_AND_STEP, then the comment seems rather off point. It should be more like "Fall through to do regular AND step processing as well". The place where the comment would be on point is probably over here: + case AND_EXPR: + /* + * ANDs have at least two arguments, so that + * no step needs to be both FIRST and LAST. + */ + Assert(list_length(boolexpr->args) >= 2); + + if (off == 0) + scratch.opcode = EEOP_BOOL_AND_STEP_FIRST; + else if (off + 1 == nargs) + scratch.opcode = EEOP_BOOL_AND_STEP_LAST; + else + scratch.opcode = EEOP_BOOL_AND_STEP; + break; although I think the Assert ought to be examining nargs not list_length(boolexpr->args) so that it has some visible connection to the code after it. (This all applies to OR processing as well, of course.) BTW, it sure seems like ExecInitExprRec and related code ought to set off all sorts of valgrind complaints? It's not being careful at all to ensure that all fields of the "scratch" record get initialized before we memcpy it to someplace. regards, tom lane
On 2017-03-15 18:48:28 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-15 18:16:57 -0400, Tom Lane wrote: > >> [ scratches head... ] What deforming logic do you think I'm proposing > >> removing? > > > I thought you were suggesting that we don't do the get_last_attnums (and > > inlined version in the isSimpleVar case) at execution time anymore, > > instead relying on logic in the planner to know how much to deform ahead > > of time. Then we'd do slot_getsomeattrs in the appropriate places. But > > I understood you suggesting to do so only in scan nodes - which doesn't > > seem sufficient, due to the use of materialized / minimal tuples in > > other types of nodes. Did I misunderstand? > > We would need to do it anywhere that we might be projecting from a > materialized tuple, I suppose. From the planner's standpoint it would be > about as easy to do this for all plan nodes as only selected ones. Yea. > Anyway the core point here seems to be that skipping ExecProject misses a > bet because the underlying tuple doesn't get disassembled. A quick-hack > way of seeing if this helps might be to do slot_getallattrs in the places > where we skip ExecProject. > I'm not sure we'd want that as a permanent > solution, because it's possible that it extracts columns not actually > needed, but it should at least move the hotspot in your example case. Hm, that could hurt pretty badly if we actually only access the first few columns. I wonder if we shouldn't essentially get rid of all slot_getattr() calls, and replace them with one slot_getsomeattrs() and then direct tts_values/nulls access. Where we compute the column number is then essentially a separate discussion. Looking around most of them seem to access multiple columns, and several of them probably can't conveniently done via the planner: src/backend/catalog/partition.c 1635: datum = slot_getattr(slot, keycol, &isNull); acesses all the partitioned columns and has convenient location to compute column to deform to (RelationGetPartitionDispatchInfo). src/backend/catalog/index.c 1797: iDatum = slot_getattr(slot, keycol, &isNull); acesses all the partitioned columns and has convenient location to compute column to deform to (BuildIndexInfo). src/backend/utils/adt/orderedsetaggs.c 1196: Datum d = slot_getattr(slot, nargs + 1, &isnull); 1359: Datum d = slot_getattr(slot, nargs + 1, &isnull); no benefit. src/backend/executor/nodeMergeAppend.c 246: datum1 = slot_getattr(s1, attno, &isNull1); 247: datum2 = slot_getattr(s2, attno, &isNull2); accesses all columns in the sort key, convenient place to prepare (ExecInitMergeAppend). src/backend/executor/execQual.c 594: * caught inside slot_getattr). What we have to check for here is the 600: * Note: we allow a reference to a dropped attribute. slot_getattr will 637: return slot_getattr(slot, attnum, isNull); 676: return slot_getattr(slot, attnum, isNull); 1681: return slot_getattr(fcache->funcResultSlot, 1, isNull); Already converted in proposed expression evaluation patch. src/backend/executor/nodeSubplan.c 367: dvalue = slot_getattr(slot, 1, &disnull); 395: prmdata->value = slot_getattr(slot, col, &(prmdata->isnull)); 565: prmdata->value = slot_getattr(slot, col, 1015: dvalue = slot_getattr(slot, 1, &disnull); Accesses all params, convenient location to compute column number (ExecInitSubPlan). src/backend/executor/execCurrent.c 186: /* Use slot_getattr to catch any possible mistakes */ 188: DatumGetObjectId(slot_getattr(scanstate->ss_ScanTupleSlot, 193: DatumGetPointer(slot_getattr(scanstate->ss_ScanTupleSlot, Accesses only system columns. src/backend/executor/nodeSetOp.c 107: flag = DatumGetInt32(slot_getattr(inputslot, Not an actual column. src/backend/executor/nodeGatherMerge.c 678: datum1 = slot_getattr(s1, attno, &isNull1); 679: datum2 = slot_getattr(s2, attno, &isNull2); Same as mergeAppend (huh, why did we copy this code), i.e. beneficial. src/backend/executor/functions.c 970: value = slot_getattr(slot, 1, &(fcinfo->isnull)); no benefit. src/backend/executor/execJunk.c 253: return slot_getattr(slot, attno, isNull); no benefit. src/backend/executor/nodeNestloop.c 136: prm->value = slot_getattr(outerTupleSlot, accesses all future param values, convenient place to compute column number. src/backend/executor/execGrouping.c 100: attr1 = slot_getattr(slot1, att, &isNull1); 102: attr2 = slot_getattr(slot2, att, &isNull2); 170: attr1 = slot_getattr(slot1, att, &isNull1); 175: attr2 = slot_getattr(slot2, att, &isNull2); 501: attr = slot_getattr(slot, att, &isNull); accesses all grouped-on values, but only some have a convenient place to compute column number. src/backend/access/common/printtup.c 545: attr = slot_getattr(slot, i + 1, &isnull); Debug routine. contrib/postgres_fdw/postgres_fdw.c 3213: value = slot_getattr(slot, attnum, &isnull); Computes all parameter values, convenient-ish place to compute colun number. - Andres
Hi, On 2017-03-15 20:09:03 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > [ new patches ] > > I've started to look at 0004, and the first conclusion I've come to > is that it's *direly* short of documentation. To the point that I'd > vote against committing it if something isn't done about that. Yea, I asked for input about what's hard to understand and what's not - I stared at this for a *lot* of time, and it all kinda looks easy-ish now. I'm more than willing to expand on the below, and other pieces. > As an example, it's quite unclear how ExprEvalSteps acquire correct > resnull/resvalue pointers, and the algorithm for that seems nontrivial. > It doesn't help any that the arguments of ExecInitExprRec are entirely > undocumented. Generally whatever wants the result of a (sub-)expression passes in the desired resvalue/resnull. E.g. when doing a function call the individual arguments are each prepared for evaluation using ExecInitExprRec() and resvalue/resnull are pointing into fcinfo's arg/nulls[i]. > I think it would be worth creating a README file giving an overview > of how all of this patch is supposed to work. You also need to do a > whole lot more work on the function-level comments. Ok. > A specific thing I noticed in the particular area of > what-gets-returned-where is this bit in EEOP_SCALARARRAYOP setup: > > + /* > + * Evaluate array argument into our return value, overwrite > + * with comparison results afterwards. > + */ > + ExecInitExprRec((Expr *) lsecond(opexpr->args), parent, state, > + resv, resnull); > > That scares me quite a bit, because it smells exactly like the sort of > premature optimization that bit us on the rear in CVE-2016-5423 (cf commit > f0c7b789a). I don't think there's a danger similar to f0c7b789a here, because the "caller" (i.e. the node that needs the expression's result) expects resvalue/null to be overwritten. It'll e.g. be the value "slot" of one arm (is there a better name for one part of a boolean expression?) of a boolean expression. > What's it really buying us to overwrite the return value > early rather than storing into the fcinfo's second argument slot? That'd work just as well. > (The memory of that CVE is part of what's prompting me to demand a clear > explanation of the algorithm for deciding where steps return their > results. Even if this particular code is safe, somebody is going to do > something unsafe in future if there's not a documented rule to follow.) I don't think there's a danger here, but I think you more generally have a point. > Another thing that ties into the do-I-understand-this-at-all question > is this bit: > > + EEO_CASE(EEOP_BOOL_AND_STEP_FIRST) > + { > + *op->d.boolexpr.anynull = false; > + > + /* > + * Fallthrough (can't be last - ANDs have two arguments at least). > + */ > + } > + > + EEO_CASE(EEOP_BOOL_AND_STEP) > > It seems like this is missing an "op++;" before falling through. If it > isn't, because really BOOL_AND_STEP_FIRST is defined as clearing anynull > and then also doing a regular BOOL_AND_STEP, then the comment seems rather > off point. It's intended to fall through this way, i.e. the difference between _FIRST and not is just that only the former clears anynull. What the comment is about, admittedly too cryptically, is that the _LAST step that then evaluates anynull cannot be the same step as EEOP_BOOL_AND_STEP_FIRST, because bool AND/OR always has at least two "arms". Will expand / move. > BTW, it sure seems like ExecInitExprRec and related code ought to set > off all sorts of valgrind complaints? It's not being careful at all > to ensure that all fields of the "scratch" record get initialized before > we memcpy it to someplace. It worked not long ago - valgrind's replacment memcpy() doesn't trigger undefined memory warnings, it just copies the "definedness" of each byte (or bit?). But your point gives me an idea: It seems like a good idea to VALGRIND_MAKE_MEM_UNDEFINED() the "scratch" step at some convenient places, so the definedness of individual operations is more useful. Thanks for the look! - Andres
Hi, I got a test failure with this version of the patch in the postges_fdw. It looks to me like it was caused by a typo in the source code which is fixed in the attached patch. After applying this patch check-world passes. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 20:09:03 -0400, Tom Lane wrote: >> That scares me quite a bit, because it smells exactly like the sort of >> premature optimization that bit us on the rear in CVE-2016-5423 (cf commit >> f0c7b789a). > I don't think there's a danger similar to f0c7b789a here, because the > "caller" (i.e. the node that needs the expression's result) expects > resvalue/null to be overwritten. Yeah, that's what I thought when I wrote the broken code in ExecEvalCase, too. It was wrong. Basically you've got to be sure that no aliasing can occur, and I think the only way to be safe about that is to have a very clear rule about where results are allowed to get returned to, preferably one that doesn't ever re-use the same target. (I think the repeated use of the same subexpression result address for the arms of an AND or OR is okay, but it would be a good idea to have a clear statement of why.) The thing that actually made the ExecEvalCase code into a bug was that we were using ExprContext-level fields to store the current caseValue, allowing aliasing to occur across nested CASEs. I think that in this implementation, it ought to be possible to get rid of ExprContext.caseValue_datum et al altogether, in favor of some storage location that's private to each CASE expression. I'm a bit disappointed that that doesn't seem to have happened. Eventually, I would also like to find a way to remove the restriction imposed by the other part of f0c7b789a, ie that we can't inline a SQL function when that would result in intermixing two levels of CASE expression. An implementation along the lines of what I've sketched above could handle that easily enough, as long as we could identify which nested level of CASE a particular CaseTestExpr belongs to. I don't know how to do that offhand, but it seems like it ought to be soluble if we put a bit of thought into it. regards, tom lane
I wrote: > Andres Freund <andres@anarazel.de> writes: >> I don't think there's a danger similar to f0c7b789a here, because the >> "caller" (i.e. the node that needs the expression's result) expects >> resvalue/null to be overwritten. > Yeah, that's what I thought when I wrote the broken code in ExecEvalCase, > too. It was wrong. Along the same line, I notice that you've got some expr step types overwriting their own input, the various flavors of EEOP_BOOLTEST for example. Maybe that's all right but it doesn't really give me a warm feeling, especially when other single-argument operations like EEOP_BOOL_NOT_STEP are done differently. Again, I think a clear explanation of the design is essential to allow people to reason about whether this sort of trickery is safe. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > [ latest patches ] I looked through 0001 (the composite-type-dependencies one). Although I agree that it'd be good to tighten things up in that area, I do not think we want this specific patch: it tightens things too much. Consider this variant of the existing test case in create_view.sql: create table tt (f1 int, f2 int, f3 int); create function ttf() returns setof tt as 'select * from tt' language sql; create view vv as select f3 from ttf(); alter table tt drop column f3; Current code allows the above (and causes the view to return nulls afterwards). The 0001 patch would forbid the DROP, which is fine, but it would also forbid dropping either of the other two table columns, which I think is insupportable. Also, given the above table and function, consider create view vvv as select ttf(); This view returns a 3-column composite type. Now do alter table tt add column f4 int; Now the view returns a 4-column composite type. But at this point the patch will let you drop the f4 column, but not any of the earlier three. That's just weird. So I'm unhappy with the specific decisions made in 0001. I think what we really want there, probably, is for find_expr_references_walker to do more than nothing with Vars referencing non-RELATION RTEs. Having said all that, I think that 0001 is contributing very little to the goals of this patch set. Andres stated that he wanted it so as to drop some of the one-time checks that execQual.c currently does for Vars, but I'm not really convinced that we could do that safely even with these additional dependencies in place. Moreover, I really doubt that there's a horrible performance cost from including something like if (unlikely(op->first_execution)) out_of_line_checking_subroutine(...); in the execution code for Vars. And that certainly isn't adding any complexity for JIT compilation that we don't face anyway for other execution step types. So my recommendation is to drop 0001 and include the same one-time checks that execQual.c currently has as out-of-line one-time checks in the new code. We can revisit that later, but time grows short for v10. I would much rather have a solid version of 0004 and not 0001, than not have anything for v10 because we spent too much time dealing with adding new dependencies. regards, tom lane
Hi, On 2017-03-17 11:36:30 -0400, Tom Lane wrote: > Having said all that, I think that 0001 is contributing very little to the > goals of this patch set. Andres stated that he wanted it so as to drop > some of the one-time checks that execQual.c currently does for Vars, but > I'm not really convinced that we could do that safely even with these > additional dependencies in place. Moreover, I really doubt that there's > a horrible performance cost from including something like > > if (unlikely(op->first_execution)) > out_of_line_checking_subroutine(...); > in the execution code for Vars. But it actually does (well, for some relatively small value of horrible) The issue is that op->first_execution is actually badly predictable, because it will be set back/forth between executions of different expressions (in the same plantree). Obviously you'll not notice if you have a Var and then some expensive stuff, but it's noticeable for cheap-ish expressions (say e.g. a single Var). So the branch prediction often doesn't handle this gracefully - it also just expands the set of to-be-tracked jumps. If we had a decent way to actually check this during ExecInitExpr() (et al), the whole discussion would be different - I'd be all expanding the set of such checks even. But I couldn't find a decent way to get there - when expressions are initialized we don't even get an ExprContext (not to speak of valid slots), nor is parent->plan very helpful. That said, it seems this is something that has to wait for a later release, I'm putting back in similar logic as there was before (not a branch, but change the opcode to a non-checking variant). > And that certainly isn't adding any > complexity for JIT compilation that we don't face anyway for other > execution step types. Obviously a if (op->first_execution) isn't an issue, it's actually only doing the first time through that's not easily possible. > So my recommendation is to drop 0001 and include the same one-time > checks that execQual.c currently has as out-of-line one-time checks > in the new code. We can revisit that later, but time grows short for > v10. I would much rather have a solid version of 0004 and not 0001, > than not have anything for v10 because we spent too much time dealing > with adding new dependencies. Doing that (+README). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > That said, it seems this is something that has to wait for a later > release, I'm putting back in similar logic as there was before (not a > branch, but change the opcode to a non-checking variant). Yeah, I was wondering if changing the opcode would be preferable to a first-time flag. >> So my recommendation is to drop 0001 and include the same one-time >> checks that execQual.c currently has as out-of-line one-time checks >> in the new code. We can revisit that later, but time grows short for >> v10. I would much rather have a solid version of 0004 and not 0001, >> than not have anything for v10 because we spent too much time dealing >> with adding new dependencies. > Doing that (+README). OK. I believe that we can get this committed after the documentation problems are sorted. I noticed a lot of small things that bugged me, mostly sloppy comments, but I think that the most efficient way to handle those is for me to make an editorial pass on your next version. regards, tom lane
Hi, On 2017-03-16 11:15:16 -0400, Tom Lane wrote: > The thing that actually made the ExecEvalCase code into a bug was that > we were using ExprContext-level fields to store the current caseValue, > allowing aliasing to occur across nested CASEs. I think that in > this implementation, it ought to be possible to get rid of > ExprContext.caseValue_datum et al altogether, in favor of some storage > location that's private to each CASE expression. I'm a bit disappointed > that that doesn't seem to have happened. The patch actually does so - during ExecInitExprRec the "relevant" case/domain testval is stored in ExprState->innermost_*, those pointers are then stored directly in the relevant steps for CaseTest/CoerceToDomainValue evaluation. Unfortunately CaseTest/CoerceToDomainValue are reused outside of domain / case expressions in a bunch of places (plpgsql uses CaseTest for casts evaluation, validateDomainConstraint/domain_check_input evaluate domain constraints without a CoerceToDomain node). I.e ExprContext.caseValue_datum etc. aren't used for normal expressions anymore, just for the ad-hoc hackery in a bunch of places. I'd like to get rid of those usages, but that'd recurse into rewriting plpgsql casts and other random pieces of code into a different approach - something I'd like to avoid doing at the same as this already large patch. I've been pondering if we can't entirely get rid of CaseTest etc, the amount of hackery required seems not like a good thing. One way I'd prototyped was to replace them with PARAM_EXEC nodes - then the whole issue of them potentially having different values at different parts of an expression vanishes because the aliasing is removed. > Eventually, I would also like to find a way to remove the restriction > imposed by the other part of f0c7b789a, ie that we can't inline a SQL > function when that would result in intermixing two levels of CASE > expression. An implementation along the lines of what I've sketched > above could handle that easily enough, as long as we could identify > which nested level of CASE a particular CaseTestExpr belongs to. > I don't know how to do that offhand, but it seems like it ought to be > soluble if we put a bit of thought into it. I haven't thought overly much about this, but I agree, it looks like it should be doable. That seems to suggest my PARAM_EXEC idea isn't necessarily perfect - inlining would cause aliasing again, but it'd also not hard to fix that up. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-16 11:15:16 -0400, Tom Lane wrote: >> The thing that actually made the ExecEvalCase code into a bug was that >> we were using ExprContext-level fields to store the current caseValue, >> allowing aliasing to occur across nested CASEs. I think that in >> this implementation, it ought to be possible to get rid of >> ExprContext.caseValue_datum et al altogether, in favor of some storage >> location that's private to each CASE expression. I'm a bit disappointed >> that that doesn't seem to have happened. > ... Unfortunately > CaseTest/CoerceToDomainValue are reused outside of domain / case > expressions in a bunch of places (plpgsql uses CaseTest for casts > evaluation, validateDomainConstraint/domain_check_input evaluate domain > constraints without a CoerceToDomain node). Yeah, there's various stuff that did that for expediency. > I'd like to get rid of those usages, but that'd recurse into rewriting > plpgsql casts and other random pieces of code into a different approach > - something I'd like to avoid doing at the same as this already large > patch. Agreed, maybe we should just plan to clean that up later. > I've been pondering if we can't entirely get rid of CaseTest etc, the > amount of hackery required seems not like a good thing. One way I'd > prototyped was to replace them with PARAM_EXEC nodes - then the whole > issue of them potentially having different values at different parts of > an expression vanishes because the aliasing is removed. Yes, replacing all of that with Param slots had occurred to me too. We might want to keep the special parse node types for convenience in reverse-listing, but having them act just like PARAM_EXEC for execution purposes seems promising. >> Eventually, I would also like to find a way to remove the restriction >> imposed by the other part of f0c7b789a, ie that we can't inline a SQL >> function when that would result in intermixing two levels of CASE >> expression. > That seems to suggest my PARAM_EXEC idea isn't necessarily perfect - > inlining would cause aliasing again, but it'd also not hard to fix that > up. Right, inlining would probably require some parameter-renumbering to avoid aliasing. But at least we'd have a clear framework for how to handle it, whereas the way things are now, it's just impossible. regards, tom lane
Hi, On 2017-03-19 23:55:50 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I've been pondering if we can't entirely get rid of CaseTest etc, the > > amount of hackery required seems not like a good thing. One way I'd > > prototyped was to replace them with PARAM_EXEC nodes - then the whole > > issue of them potentially having different values at different parts of > > an expression vanishes because the aliasing is removed. > > Yes, replacing all of that with Param slots had occurred to me too. > We might want to keep the special parse node types for convenience in > reverse-listing, but having them act just like PARAM_EXEC for execution > purposes seems promising. As long as that special parse-time node is part of the same value numbering, that makes sense (could just name make it a subtype of param ala PARAM_CASE). I don't think we actually do anything useful in ruleutils etc with either CaseTest or CoerceToDomainValue. - Andres
Hi, On 2017-03-15 20:09:03 -0400, Tom Lane wrote: > I think it would be worth creating a README file giving an overview > of how all of this patch is supposed to work. You also need to do a > whole lot more work on the function-level comments. I tried to improve upon both fronts. I've added the higher level explanation to executor/README, but I don't feel very strong about that. I'm not quite sure it's exactly what you wanted however, the above ask could also be understood to have more of an motivational angle, describing why and what exactly is changed? I'm also still not sure how understandable it's for anybody that hasn't had their head in this for a while... I've also, as discussed nearby, re-added code akin to the checks in ExecEvalScalarVar, with the discussed adjustment of the opcodes for later executions. Now that I've done that, I'm not sure I like that approach that much - another alternative would be to change an ExprState's evalfunc to ExecCheckAndExecute() after initialization. That'd have the advantage to work nicely for JIT. Either way, that can trivially be changed later. Additionally I added a regression test for the nearly entirely untested nodeTidscan.c, after I'd broken it previously without noticing (thanks Andreas). I started a run through valgrind, without complaints up to create_function_2. Can't run all of it (and contrib) right now without prematurely running out of power on a plane. Did I understand correctly that you'd rather just merge ExecGetLastAttnums into execExpr.c, instead of making it globally available? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2017-03-15 20:09:03 -0400, Tom Lane wrote: >> I think it would be worth creating a README file giving an overview >> of how all of this patch is supposed to work. You also need to do a >> whole lot more work on the function-level comments. > I tried to improve upon both fronts. I've added the higher level > explanation to executor/README, but I don't feel very strong about that. > I'm not quite sure it's exactly what you wanted however, the above ask > could also be understood to have more of an motivational angle, > describing why and what exactly is changed? I'm also still not sure how > understandable it's for anybody that hasn't had their head in this for a > while... Well, I wasn't totally sure what was needed either. I'm coming to this relatively fresh, having paid little attention to the thread up to now, so maybe I'll try to add material to what you wrote as I figure things out. > Did I understand correctly that you'd rather just merge > ExecGetLastAttnums into execExpr.c, instead of making it globally > available? Yeah, just moving it over seems like the thing to do for now. We can expose it later if there proves to be an actual reason to do that, but as I mentioned, I'm doubtful that there will be one. I'll start reading these... regards, tom lane
Andres Freund <andres@anarazel.de> writes: > Additionally I added a regression test for the nearly entirely untested > nodeTidscan.c, after I'd broken it previously without noticing (thanks > Andreas). I went ahead and pushed this part, since it seemed pretty uncontroversial. I added a bit more stuff to get the LOC measurement up on the planner side too. regards, tom lane
... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared size_t and not just int? Since it's an array index, and one that certainly can't be bigger than AttrNumber, that seems rather confusing. regards, tom lane
On 2017-03-20 16:06:27 -0400, Tom Lane wrote: > ... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared > size_t and not just int? Since it's an array index, and one that > certainly can't be bigger than AttrNumber, that seems rather confusing. Not that I can see, no. I guess I might have "overcompensated" when changing it from AttrNumber - AttrNumber isn't a good idea because that needs an extra move-zero-extend, because 16bit indexing isn't that well supported on x86. But that doesn't mean it should be a 64bit number - to the contrary actually. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-20 16:06:27 -0400, Tom Lane wrote: >> ... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared >> size_t and not just int? Since it's an array index, and one that >> certainly can't be bigger than AttrNumber, that seems rather confusing. > Not that I can see, no. I guess I might have "overcompensated" when > changing it from AttrNumber - AttrNumber isn't a good idea because that > needs an extra move-zero-extend, because 16bit indexing isn't that well > supported on x86. But that doesn't mean it should be a 64bit number - > to the contrary actually. OK, will fix in the edits I'm working on. regards, tom lane
I've been busily hacking away on this, trying to make things cleaner and fix a couple of bugs I stumbled across. (Confusion between ExecQual and ExecCheck, for instance - we apparently lack regression tests exercising constraints-returning-null in corner cases such as table rewrite.) It will probably be a day or two more before I'm done. A couple of naming questions: * I concur with Heikki's dislike for the file name execInterpExpr.c. Would it make it any better to switch to execExprInterp.c? I think that having all this stuff under a common pattern execExpr*.c would be a good thing (and I notice you've already got one comment referring to them that way ...) * execQual.c doesn't seem to have a unifying reason to exist anymore. It certainly has little to do with evaluating typical qual expressions; what's left in there seems to be mostly concerned with SRFs. I feel like it might be a good idea to rename it, but to what? execExprUtils.c perhaps? Or maybe we should destroy it altogether, shoving the SRF stuff into nodeFunctionscan.c and moving what little remains into execUtils.c. * I do not like the function name ExecInstantiateExpr(). Webster's defines "instantiate" as "to represent (an abstraction) by a concrete instance", which does not seem to me to have a lot to do with what this function actually does. There's nothing very abstract about its input. More, the implication of "instantiate" is that you can instantiate any number of representatives of the same abstraction, but this scribbles on the input in a one-way fashion. I think perhaps something like ExecPrepareExpr or ExecFinishExpr or something along that line would be better, but nothing is really standing out as le mot juste. Thoughts? regards, tom lane
On 2017-03-22 10:41:06 -0400, Tom Lane wrote: > I've been busily hacking away on this, trying to make things cleaner and > fix a couple of bugs I stumbled across. (Confusion between ExecQual and > ExecCheck, for instance - we apparently lack regression tests exercising > constraints-returning-null in corner cases such as table rewrite.) It > will probably be a day or two more before I'm done. Thanks! > A couple of naming questions: > > * I concur with Heikki's dislike for the file name execInterpExpr.c. > Would it make it any better to switch to execExprInterp.c? I think > that having all this stuff under a common pattern execExpr*.c would > be a good thing (and I notice you've already got one comment referring > to them that way ...) That works for me. > * execQual.c doesn't seem to have a unifying reason to exist anymore. > It certainly has little to do with evaluating typical qual expressions; > what's left in there seems to be mostly concerned with SRFs. I feel > like it might be a good idea to rename it, but to what? execExprUtils.c > perhaps? Or maybe we should destroy it altogether, shoving the SRF > stuff into nodeFunctionscan.c and moving what little remains into > execUtils.c. Yea, I was wondering about that too. What would we do with GetAttributeByName/Num? > * I do not like the function name ExecInstantiateExpr(). Webster's > defines "instantiate" as "to represent (an abstraction) by a concrete > instance", which does not seem to me to have a lot to do with what this > function actually does. It perhaps makes a *bit* more sense if you view it from the POV that, with the future JIT support (WIP versions of which I posted previously), it'd actually create a compiled function which'd largely be independent of the of ->steps (except for the non-hotpath functions, which'd still end up using it). So one of the above "conrete instances" would be the interpeted version, another the compiled one. No, not an entirely convincing argument. > There's nothing very abstract about its input. > More, the implication of "instantiate" is that you can instantiate any > number of representatives of the same abstraction, but this scribbles > on the input in a one-way fashion. I think perhaps something like > ExecPrepareExpr or ExecFinishExpr or something along that line would > be better, but nothing is really standing out as le mot juste. Either of those work, but they don't strike me as perfect either, but I can't come up with something better (ExecReadyExprForExec()?). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-22 10:41:06 -0400, Tom Lane wrote: >> * execQual.c doesn't seem to have a unifying reason to exist anymore. >> It certainly has little to do with evaluating typical qual expressions; >> what's left in there seems to be mostly concerned with SRFs. I feel >> like it might be a good idea to rename it, but to what? execExprUtils.c >> perhaps? Or maybe we should destroy it altogether, shoving the SRF >> stuff into nodeFunctionscan.c and moving what little remains into >> execUtils.c. > Yea, I was wondering about that too. What would we do with > GetAttributeByName/Num? I was thinking execUtils.c for those. >> * I do not like the function name ExecInstantiateExpr(). Webster's >> defines "instantiate" as "to represent (an abstraction) by a concrete >> instance", which does not seem to me to have a lot to do with what this >> function actually does. >> ... > Either of those work, but they don't strike me as perfect either, but I > can't come up with something better (ExecReadyExprForExec()?). Actually, ExecReadyExpr seems kind of nice for this (using "ready" in its verb sense, "to prepare (someone or something) for an activity or purpose"). regards, tom lane
BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying compile of the TID expressions until TidListCreate. I think that probably fails for cases involving, eg, subplans in the expressions; we need subplans to get linked to the parent node, and this way won't do it till (maybe) too late. Barring objection I'm going to rearrange that so that we still do the compile part at executor start. regards, tom lane
On 2017-03-22 13:15:41 -0400, Tom Lane wrote: > BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying > compile of the TID expressions until TidListCreate. I think that probably > fails for cases involving, eg, subplans in the expressions; we need > subplans to get linked to the parent node, and this way won't do it till > (maybe) too late. Barring objection I'm going to rearrange that so that > we still do the compile part at executor start. No objection here.
Looking at some of the coding choices you made here, I see that you're making a hard assumption that called functions never scribble on their fcinfo->arg/argnull arrays. I do not believe that we've ever had such an assumption before. Are we comfortable with that? If so, we'd better document it. regards, tom lane
Hi, On 2017-03-23 13:14:59 -0400, Tom Lane wrote: > Looking at some of the coding choices you made here, I see that you're > making a hard assumption that called functions never scribble on their > fcinfo->arg/argnull arrays. I do not believe that we've ever had such > an assumption before. I think we did that before, e.g. ExecEvalScalarArrayOp(). Think there's others too. > Are we comfortable with that? If so, we'd better document it. I think it's ok, but we indeed should document it. I recall a note somewhere... Can't find it anywhere however, might have misremembered a note about pass-by-ref arguments. fmgr/README? A note in FunctionCallInfoData's definition? Greetings, Andres Freund
Stylistic thought ... I am wondering if it wouldn't be a good idea to replace EEOP_CASE_WHEN_STEP, EEOP_CASE_THEN_STEP, EEOP_COALESCE, and EEOP_ARRAYREF_CHECKINPUT with instructions defined in a less usage-dependent way as EEOP_JUMP unconditional jumpEEOP_JUMP_IF_NULL jump if step result is nullEEOP_JUMP_IF_NOT_NULL jump if stepresult isn't nullEEOP_JUMP_IF_NOT_TRUE jump if step result isn't TRUE One could imagine later filling out this set with the other BoolTest condition types, but that seems to be all we need right now. These are basically just renamings of the step types that exist now, although EEOP_ARRAYREF_CHECKINPUT would have to drop its not-very- necessary Assert(!op->d.arrayref.state->isassignment). Well, I guess I should say that they're renamings of the semantics that I have for these steps in my working copy; for instance, I got rid of casewhen.value/casewhen.isnull in favor of letting CASE WHEN expressions evaluate into the CASE's final output variable. At least to me, I think the compiling code would be more readable this way. I find WHEN_STEP and THEN_STEP a bit odd because they are emitted after, not before, the expressions you'd think they control. ARRAYREF_CHECKINPUT is pretty vaguely named, too. regards, tom lane
I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about + /* + * Can't assert tts_nvalid, as wholerow var evaluation or such + * could have materialized the slot - but the contents are still + * valid :/ + */ + Assert(op->d.var.attnum >= 0); is actually averting its eyes from a potentially serious problem. If we go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and ExecFetchSlotTuple decides it has to materialize the tuple, then ExecMaterializeSlot does this: /* * Drop the pin on the referenced buffer, if there is one. */ if (BufferIsValid(slot->tts_buffer)) ReleaseBuffer(slot->tts_buffer); slot->tts_buffer = InvalidBuffer; /* * Mark extracted state invalid. This is important because the slot is * not supposed to depend any more on theprevious external data; we * mustn't leave any dangling pass-by-reference datums in tts_values. * However, we havenot actually invalidated any such datums, if there * happen to be any previously fetched from the slot. (Note inparticular * that we have not pfree'd tts_mintuple, if there is one.) */ slot->tts_nvalid = 0; The problem here is that once we drop the buffer pin, any pointers we may have into on-disk data are dangling pointers --- we're at risk of some other backend taking away that shared buffer. (So I'm afraid that the commentary there is overly optimistic.) So even though those pointers may still be there beyond tts_nvalid, subsequent references to them are very dangerous. I think that it's pretty hard to hit this in practice, maybe impossible, because the normal case for an "on-disk" tuple is that TTS_HAS_PHYSICAL_TUPLE is true, so that ExecFetchSlotTuple won't change the state of the slot. If we have a virtual tuple that has to be materialized, then by that very token it won't have a buffer pin to drop. But I find this fragile as heck, and the aforesaid patch comment surely isn't adequately documenting the safety considerations. Also, if there ever were a live bug here, reproducing it would be damn hard because of the low probability that a just-unpinned buffer would get replaced any time soon. (Hm, I wonder whether the buffer cache needs something analogous to the syscaches' CLOBBER_CACHE_ALWAYS behavior...) Besides which, I really really don't like the lack of an "attnum < tts_nvalid" assertion there; that's just obviously failing to check for very simple bugs, such as getting the FETCHSOME steps wrong. So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't clobber the state of the slot. Right at the moment, the only way to do that seems to be to do this instead of ExecFetchSlotTupleDatum: tuple = ExecCopySlotTuple(slot); dtuple = (HeapTupleHeader) DatumGetPointer(heap_copy_tuple_as_datum(tuple, slot->tts_tupleDescriptor)); heap_freetuple(tuple); That's kind of annoying because of the double copying involved, but tuptoaster.c doesn't expose any functionality for this except heap_copy_tuple_as_datum(). I figure we can improve it later --- it looks like we can refactor heap_copy_tuple_as_datum to expose a function that reads from Datum/isnull arrays, and then call it on the slot's tts_values/tts_isnull arrays. Seems like it might be a good idea to think a bit harder about the TupleTableSlot APIs, too, and reduce the number of cases where execTuples exposes destructive changes to the state of a slot. Also, while trying to test the above scenario, I realized that the patch as submitted was being way too cavalier about where it was applying CheckVarSlotCompatibility and so on. The ASSIGN_FOO_VAR steps, for instance, had no protection at all. Think I have that all fixed up though. I hope to have a fully reviewed patch to pass back to you tomorrow. Or Saturday at the latest. regards, tom lane
Hi,m On 2017-03-23 17:40:55 -0400, Tom Lane wrote: > Stylistic thought ... I am wondering if it wouldn't be a good idea > to replace EEOP_CASE_WHEN_STEP, EEOP_CASE_THEN_STEP, EEOP_COALESCE, > and EEOP_ARRAYREF_CHECKINPUT with instructions defined in a less > usage-dependent way as > > EEOP_JUMP unconditional jump > EEOP_JUMP_IF_NULL jump if step result is null > EEOP_JUMP_IF_NOT_NULL jump if step result isn't null > EEOP_JUMP_IF_NOT_TRUE jump if step result isn't TRUE > > One could imagine later filling out this set with the other BoolTest > condition types, but that seems to be all we need right now. Hm, no arguments against, but I'm also not particularly excited about the change. > These are basically just renamings of the step types that exist now, > although EEOP_ARRAYREF_CHECKINPUT would have to drop its not-very- > necessary Assert(!op->d.arrayref.state->isassignment). I won't shed a tear about that assert's removal. > Well, I guess I should say that they're renamings of the semantics > that I have for these steps in my working copy; for instance, I got > rid of casewhen.value/casewhen.isnull in favor of letting CASE WHEN > expressions evaluate into the CASE's final output variable. That sounds like a sensible change (in the abstract, I obviously haven't seen your working copy). Greetings, Andres Freund
On 2017-03-23 20:36:32 -0400, Tom Lane wrote: > I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about > > + /* > + * Can't assert tts_nvalid, as wholerow var evaluation or such > + * could have materialized the slot - but the contents are still > + * valid :/ > + */ > + Assert(op->d.var.attnum >= 0); > > is actually averting its eyes from a potentially serious problem. If we > go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and > ExecFetchSlotTuple decides it has to materialize the tuple, then > ExecMaterializeSlot does this: > > /* > * Drop the pin on the referenced buffer, if there is one. > */ > if (BufferIsValid(slot->tts_buffer)) > ReleaseBuffer(slot->tts_buffer); > > slot->tts_buffer = InvalidBuffer; > > /* > * Mark extracted state invalid. This is important because the slot is > * not supposed to depend any more on the previous external data; we > * mustn't leave any dangling pass-by-reference datums in tts_values. > * However, we have not actually invalidated any such datums, if there > * happen to be any previously fetched from the slot. (Note in particular > * that we have not pfree'd tts_mintuple, if there is one.) > */ > slot->tts_nvalid = 0; > > > The problem here is that once we drop the buffer pin, any pointers we may > have into on-disk data are dangling pointers --- we're at risk of some > other backend taking away that shared buffer. (So I'm afraid that the > commentary there is overly optimistic.) So even though those pointers > may still be there beyond tts_nvalid, subsequent references to them are > very dangerous. This applies to the code in master as well, no? An ExecEvalScalarVar() followed by an ExecEvalWholeRowVar() would have precisely the same effect? Do we need to do anything about this in the back-branches, given how unlikely this is going to be in practice? > So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't > clobber the state of the slot. That seems like a good plan. > Also, while trying to test the above scenario, I realized that the patch > as submitted was being way too cavalier about where it was applying > CheckVarSlotCompatibility and so on. The ASSIGN_FOO_VAR steps, for > instance, had no protection at all. I don't think that's true - the assign checks had copied the code from the old ExecBuildProjectionInfo, setting isSimpleVar iff (!attr->attisdropped && variable->vartype == attr->atttypid) - we can check that for projections in contrast to normal expressions because we already know the slot. The relevant comment for that, from before the patch, is: * inputDesc. (Note: if there is a type mismatch then ExecEvalScalarVar * will probably throw an error at runtime,but we leave that to it.) */ > I hope to have a fully reviewed patch to pass back to you tomorrow. > Or Saturday at the latest. Cool. Greetings, Andres Freund
On 2017-03-23 20:36:32 -0400, Tom Lane wrote: > So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't > clobber the state of the slot. Right at the moment, the only way to do > that seems to be to do this instead of ExecFetchSlotTupleDatum: > > tuple = ExecCopySlotTuple(slot); > dtuple = (HeapTupleHeader) > DatumGetPointer(heap_copy_tuple_as_datum(tuple, > slot->tts_tupleDescriptor)); > heap_freetuple(tuple); Hm. One disadvantage would be that repeated whole-row references to the same table would be a bit slower, because we'd repeatedly form a tuple from a virtual one - but I have a hard time coming up with a scenario where that'd matter. I'd suspect that in the end it'd probably even have a *positive* performance impact, because right now the next scalar access will have to deform the whole tuple again, and that seems like a lot more likely scenario. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-23 20:36:32 -0400, Tom Lane wrote: >> The problem here is that once we drop the buffer pin, any pointers we may >> have into on-disk data are dangling pointers --- we're at risk of some >> other backend taking away that shared buffer. (So I'm afraid that the >> commentary there is overly optimistic.) So even though those pointers >> may still be there beyond tts_nvalid, subsequent references to them are >> very dangerous. > This applies to the code in master as well, no? An ExecEvalScalarVar() > followed by an ExecEvalWholeRowVar() would have precisely the same > effect? Yeah. The other order would be safe, because ExecEvalScalarVar would do slot_getattr which would re-extract the value from the newly materialized tuple. But there definitely seems to be a hazard for the order you mentioned. > Do we need to do anything about this in the back-branches, > given how unlikely this is going to be in practice? Probably not. As I mentioned, I think this may be only theoretical rather than real, if you believe that buffer pins would only be associated with slots holding references to regular tuples. And even if it's not theoretical, the odds of seeing a failure in the field seem pretty tiny given that a just-released buffer shouldn't be subject to recycling for a fair while. But I don't want to leave it like this going forward. >> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't >> clobber the state of the slot. > That seems like a good plan. Yeah. I have the stopgap code in my working copy, and will look at refactoring the tuptoaster code for better performance later. >> Also, while trying to test the above scenario, I realized that the patch >> as submitted was being way too cavalier about where it was applying >> CheckVarSlotCompatibility and so on. The ASSIGN_FOO_VAR steps, for >> instance, had no protection at all. > I don't think that's true - the assign checks had copied the code from > the old ExecBuildProjectionInfo, setting isSimpleVar iff > (!attr->attisdropped && variable->vartype == attr->atttypid) - we can > check that for projections in contrast to normal expressions because we > already know the slot. Hmm, I see ... but that only works in the cases where the caller of ExecBuildProjectionInfo supplied a source slot, and a lot of 'em don't. As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot of places, including everywhere above the relation scan level. I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST step types. I could take it back out, but I wonder if it wouldn't be smarter to keep it and remove the restriction in ExecBuildProjectionInfo. Or maybe we could have ExecBuildProjectionInfo emit either ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove the reference safe. regards, tom lane
On 2017-03-23 21:26:19 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-23 20:36:32 -0400, Tom Lane wrote: > >> The problem here is that once we drop the buffer pin, any pointers we may > >> have into on-disk data are dangling pointers --- we're at risk of some > >> other backend taking away that shared buffer. (So I'm afraid that the > >> commentary there is overly optimistic.) So even though those pointers > >> may still be there beyond tts_nvalid, subsequent references to them are > >> very dangerous. > > > This applies to the code in master as well, no? An ExecEvalScalarVar() > > followed by an ExecEvalWholeRowVar() would have precisely the same > > effect? > > Yeah. The other order would be safe, because ExecEvalScalarVar would do > slot_getattr which would re-extract the value from the newly materialized > tuple. But there definitely seems to be a hazard for the order you > mentioned. > > > Do we need to do anything about this in the back-branches, > > given how unlikely this is going to be in practice? > > Probably not. As I mentioned, I think this may be only theoretical rather > than real, if you believe that buffer pins would only be associated with > slots holding references to regular tuples. And even if it's not > theoretical, the odds of seeing a failure in the field seem pretty tiny > given that a just-released buffer shouldn't be subject to recycling for > a fair while. But I don't want to leave it like this going forward. Ok. > >> Also, while trying to test the above scenario, I realized that the patch > >> as submitted was being way too cavalier about where it was applying > >> CheckVarSlotCompatibility and so on. The ASSIGN_FOO_VAR steps, for > >> instance, had no protection at all. > > > I don't think that's true - the assign checks had copied the code from > > the old ExecBuildProjectionInfo, setting isSimpleVar iff > > (!attr->attisdropped && variable->vartype == attr->atttypid) - we can > > check that for projections in contrast to normal expressions because we > > already know the slot. > > Hmm, I see ... but that only works in the cases where the caller of > ExecBuildProjectionInfo supplied a source slot, and a lot of 'em > don't. Right, the old and new code comment on that: * inputDesc can be NULL, but if it is not, we check to see whether simple* Vars in the tlist match the descriptor. It isimportant to provide* inputDesc for relation-scan plan nodes, as a cross check that the relation* hasn't been changed sincethe plan was made. At higher levels of a plan,* there is no need to recheck. and that seems like reasonable to me? That said, I think we can remove that assumption, by checking once. > As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot > of places, including everywhere above the relation scan level. Hm? If inputDesc isn't given we just, before and after, do: if (!inputDesc) isSimpleVar = true; /* can't check type, assume OK */ > I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST > step types. I could take it back out, but I wonder if it wouldn't be > smarter to keep it and remove the restriction in ExecBuildProjectionInfo. > Or maybe we could have ExecBuildProjectionInfo emit either > ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove > the reference safe. I think it's probably ok to just leave the check in, and remove those comments, and simplify the isSimpleVar stuff to only check ifIsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0) - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-03-23 21:26:19 -0400, Tom Lane wrote: >> Hmm, I see ... but that only works in the cases where the caller of >> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em >> don't. > Right, the old and new code comment on that: > * inputDesc can be NULL, but if it is not, we check to see whether simple > * Vars in the tlist match the descriptor. It is important to provide > * inputDesc for relation-scan plan nodes, as a cross check that the relation > * hasn't been changed since the plan was made. At higher levels of a plan, > * there is no need to recheck. Ah, I'd forgotten the assumption that we only need to check this at scan level. >> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST >> step types. I could take it back out, but I wonder if it wouldn't be >> smarter to keep it and remove the restriction in ExecBuildProjectionInfo. >> Or maybe we could have ExecBuildProjectionInfo emit either >> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove >> the reference safe. > I think it's probably ok to just leave the check in, and remove those > comments, and simplify the isSimpleVar stuff to only check if > IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0) Not sure. It's a pretty fair amount of duplicative code, once you finish dealing with all the ExecJustFoo functions in addition to the main code paths. At this point I'm inclined to take it back out and improve the comments around ExecBuildProjectionInfo. regards, tom lane
On 2017-03-23 21:58:03 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-23 21:26:19 -0400, Tom Lane wrote: > >> Hmm, I see ... but that only works in the cases where the caller of > >> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em > >> don't. > > > Right, the old and new code comment on that: > > > * inputDesc can be NULL, but if it is not, we check to see whether simple > > * Vars in the tlist match the descriptor. It is important to provide > > * inputDesc for relation-scan plan nodes, as a cross check that the relation > > * hasn't been changed since the plan was made. At higher levels of a plan, > > * there is no need to recheck. > > Ah, I'd forgotten the assumption that we only need to check this at scan > level. > > >> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST > >> step types. I could take it back out, but I wonder if it wouldn't be > >> smarter to keep it and remove the restriction in ExecBuildProjectionInfo. > >> Or maybe we could have ExecBuildProjectionInfo emit either > >> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove > >> the reference safe. > > > I think it's probably ok to just leave the check in, and remove those > > comments, and simplify the isSimpleVar stuff to only check if > > IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0) > > Not sure. It's a pretty fair amount of duplicative code, once you finish > dealing with all the ExecJustFoo functions in addition to the main code > paths. At this point I'm inclined to take it back out and improve the > comments around ExecBuildProjectionInfo. I'm ok with both. - Andres
Another modest proposal: I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to trigger initial tupleslot de-forming. Certainly we want to have a single slot_getsomeattrs call per source slot, but as-is, we need a separate traversal over the expression tree just to precompute the max attribute number needed. That can't be good for expression compile speed, and it introduces nontrivial bug risks because the code that does that is completely decoupled from the code that emits the EEOP_VAR opcodes (which are what's really relying on the de-forming to have happened). What do you think about a design like this: * Drop the FETCHSOME opcodes. * Add fields to struct ExprState that will hold the maximum inner, outer, and scan attribute numbers needed. * ExecInitExpr initializes those fields to zero, and then during ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g. state->last_inner_attno = Max(state->last_inner_attno, variable->varattno); * ExecInitExprSlots, get_last_attnums_walker, etc all go away. * In the startup segment of ExecInterpExpr, add if (state->last_inner_attno > 0) slot_getsomeattrs(innerslot, state->last_inner_attno);if (state->last_outer_attno > 0) slot_getsomeattrs(outerslot, state->last_outer_attno);if (state->last_scan_attno > 0) slot_getsomeattrs(scanslot,state->last_scan_attno); This would be a little different from the existing code as far as runtime branch-prediction behavior goes, but it's not apparent to me that it'd be any worse. Also, for JIT purposes it'd still be entirely possible to compile the slot_getsomeattrs calls in-line; you'd just be looking to a different part of the ExprState struct to find out what to do. regards, tom lane
Hi, On 2017-03-24 11:26:27 -0400, Tom Lane wrote: > Another modest proposal: > > I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to > trigger initial tupleslot de-forming. Certainly we want to have a single > slot_getsomeattrs call per source slot, but as-is, we need a separate > traversal over the expression tree just to precompute the max attribute > number needed. That can't be good for expression compile speed, and > it introduces nontrivial bug risks because the code that does that > is completely decoupled from the code that emits the EEOP_VAR opcodes > (which are what's really relying on the de-forming to have happened). Hm. We had the separate traversal for projections for a long while, and I don't think there've been a a lot of changes to the extraction of the last attribute number. I'm very doubtful that the cost of traversing the expression twice is meaningful in comparison to the other costs. > What do you think about a design like this: > > * Drop the FETCHSOME opcodes. > > * Add fields to struct ExprState that will hold the maximum inner, > outer, and scan attribute numbers needed. > > * ExecInitExpr initializes those fields to zero, and then during > ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g. > > state->last_inner_attno = Max(state->last_inner_attno, > variable->varattno); > > * ExecInitExprSlots, get_last_attnums_walker, etc all go away. > > * In the startup segment of ExecInterpExpr, add > > if (state->last_inner_attno > 0) > slot_getsomeattrs(innerslot, state->last_inner_attno); > if (state->last_outer_attno > 0) > slot_getsomeattrs(outerslot, state->last_outer_attno); > if (state->last_scan_attno > 0) > slot_getsomeattrs(scanslot, state->last_scan_attno); > > This would be a little different from the existing code as far as runtime > branch-prediction behavior goes, but it's not apparent to me that it'd be > any worse. I'd be suprised if it weren't. I'm not super strongly against this setup, but I fail to see the benefit of whacking this around. I've benchmarked the previous/current setup fairly extensively, and I'd rather not redo that. In contrast to the other changes you've talked about, this definitely is in the hot-path... > Also, for JIT purposes it'd still be entirely possible to compile the > slot_getsomeattrs calls in-line; you'd just be looking to a different > part of the ExprState struct to find out what to do. Yea, we could do that. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-24 11:26:27 -0400, Tom Lane wrote: >> Another modest proposal: >> >> I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to >> trigger initial tupleslot de-forming. Certainly we want to have a single >> slot_getsomeattrs call per source slot, but as-is, we need a separate >> traversal over the expression tree just to precompute the max attribute >> number needed. That can't be good for expression compile speed, and >> it introduces nontrivial bug risks because the code that does that >> is completely decoupled from the code that emits the EEOP_VAR opcodes >> (which are what's really relying on the de-forming to have happened). > Hm. We had the separate traversal for projections for a long while, and > I don't think there've been a a lot of changes to the extraction of the > last attribute number. That's easily disproven just by looking at the code: /* * Don't examine the arguments or filters of Aggrefs or WindowFuncs, * because those do not represent expressionsto be evaluated within the * calling expression's econtext. GroupingFunc arguments are never * evaluatedat all. */ if (IsA(node, Aggref)) return false; if (IsA(node, WindowFunc)) return false; if(IsA(node, GroupingFunc)) return false; return expression_tree_walker(node, get_last_attnums_walker, (void *) info); The WindowFunc exception hasn't been there so long, and the GroupingFunc one is very new. And who's to say whether e.g. the recent XMLTABLE patch got this right at all? We could easily be extracting columns we don't need to. I'm willing to leave this as-is for the moment, but I really think we should look into changing it (after the main patch is in). regards, tom lane
Attached is an updated patch. I believe this is committable, but since I whacked it around quite a bit, I'm sure you'll want to look it over first. Please be sure the commit message notes that function EXECUTE permissions are now checked at executor startup not first call. We need to document that in the v10 release notes as an incompatibility, and I'm sure we'll forget if the commit log doesn't point it out. Some loose ends that remain to be looked at, though I do not think any of these are reasons to postpone commit: * I'm concerned that we now do not have enough check_stack_depth() calls. In our off-list discussion I wanted to add one to ExecProcNode, and you were unhappy about that ... but it'd still be a lot less stack checking than we do now. * I still think we should look into removing the EEOP_FETCHSOME op types, or at least finding some other way to perform the calculation of the last attnums in the mainline expression compilation path. * As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient method for flattening tuples into datums. I will take a to-do item to fix this. * ExecInitCheck is really just ExecInitExpr with a make_ands_explicit call in front of it. It turned out that most of the places that were (or should have been) calling it were doing a make_ands_implicit call for no other reason than to satisfy its API. I changed most of those to just call ExecInitExpr directly. There are just a couple of call sites left, and I think probably those are likewise not really that happy with this API --- but I didn't take the time to chase down where the expressions were coming from in those cases. It seems possible that we could remove ExecInitCheck/ExecPrepareCheck entirely. I'll look into this later, too. * As we discussed, CaseTestValue/DomainValue are pretty messy and need to be rethought. That might not get done for v10 though. * I'm not very happy that execSRF.c has two somewhat different implementations of largely similar functionality, and SetExprState.elidedFuncState seems like a wart. That's mostly a pre-existing problem of course. I'm satisfied with leaving it as it is for now, but eventually some refactoring there would be good. The attached patch is against HEAD as of last night (commit 457a44487). regards, tom lane
Attachment
btw ... I just got around to looking at a code coverage report for this patched version, and that reminded me of something I'd already suspected: EEOP_INNER_SYSVAR and EEOP_OUTER_SYSVAR seem to be dead code. That's unsurprising, because we never try to access a tuple's system columns above the scan level. If a query asks for system columns, those get passed up to upper query levels as ordinary user-side columns. We could keep the execution support for those opcodes, or we could rip it out and throw an error in execExpr.c if one would need to be generated. I'm a little inclined to the latter, because it seems like the plan is to grow more support for every opcode in the future. We don't need to be adding support for unreachable opcodes. regards, tom lane
More random musing ... have you considered making the jump-target fields in expressions be relative rather than absolute indexes? That is, EEO_JUMP would look like op += (stepno); \ EEO_DISPATCH(); \ instead of op = &state->steps[stepno]; \ EEO_DISPATCH(); \ I have not carried out a full patch to make this work, but just making that one change and examining the generated assembly code looks promising. Instead of this movslq 40(%r14), %r8salq $6, %r8addq 24(%rbx), %r8movq %r8, %r14jmp *(%r8) we get this movslq 40(%r14), %raxsalq $6, %raxaddq %rax, %r14jmp *(%r14) which certainly looks like it ought to be faster. Also, the real reason I got interested in this at all is that with relative jumps, groups of steps would be position-independent within the steps array, which would enable some compile-time tricks that seem impractical with the current definition. BTW, now that I've spent a bit of time looking at the generated assembly code, I'm kind of disinclined to believe any arguments about how we have better control over branch prediction with the jump-threading implementation. At least with current gcc (6.3.1 on Fedora 25) at -O2, what I see is multiple places jumping to the same indirect jump instruction :-(. It's not a total disaster: as best I can tell, all the uses of EEO_JUMP remain distinct. But gcc has chosen to implement about 40 of the 71 uses of EEO_NEXT by jumping to the same couple of instructions that increment the "op" register and then do an indirect jump :-(. So it seems that we're at the mercy of gcc's whims as to which instruction dispatches will be distinguishable to the hardware; which casts a very dark shadow over any benchmarking-based arguments that X is better than Y for branch prediction purposes. Compiler version differences are likely to matter a lot more than anything we do. regards, tom lane
Hi, On 2017-03-24 17:16:15 -0400, Tom Lane wrote: > Attached is an updated patch. I believe this is committable, but > since I whacked it around quite a bit, I'm sure you'll want to look > it over first. Indeed. Points: - It's going to take a while for me to get used to execExprInterp.c - I constantly try to switch to a nonexistant buffer;) - Like your approach with /* Check that current tupdesc doesn't have more fields than we allocated */ in ExecEvalFieldStoreDeForm(). - I like EEO_FLAG_IS_QUAL. - I think at some point (not sure whether in your or in my revision) ScalarArrayOpExpr lost its permission check. Added that. - Added note that we knowingly don't perform permission checks for input/output funcs. - Both pre/post patch, CoerceViaIO doesn't invoke InvokeFunctionExecuteHook - I'm still unclear what that hook is usefulfor, so ... - The !caseExpr->defresult result branch is currently unreachable (and its equivalent was before the patch) because transformCaseExpr()generates a default expression. I'm inclined to replace the dead code with an assertion. Any reason notto do that? - I see you kept determination of the set of constraints to be checked for domain at initialization time. Made note thatthat a) might change b) could change behaviour. > Please be sure the commit message notes that function EXECUTE permissions > are now checked at executor startup not first call. We need to document > that in the v10 release notes as an incompatibility, and I'm sure we'll > forget if the commit log doesn't point it out. Done. > * As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient > method for flattening tuples into datums. I will take a to-do item to > fix this. Thanks. > * ExecInitCheck is really just ExecInitExpr with a make_ands_explicit > call in front of it. It turned out that most of the places that were > (or should have been) calling it were doing a make_ands_implicit call > for no other reason than to satisfy its API. I changed most of those > to just call ExecInitExpr directly. There are just a couple of call > sites left, and I think probably those are likewise not really that > happy with this API --- but I didn't take the time to chase down where > the expressions were coming from in those cases. It seems possible > that we could remove ExecInitCheck/ExecPrepareCheck entirely. I'll > look into this later, too. Thanks^2. > * As we discussed, CaseTestValue/DomainValue are pretty messy and need > to be rethought. That might not get done for v10 though. Yea, I'm disinclined to tackle this just now, there's enough other stuff pending - but I'm willing to tackle it for v11 unless you beat me to it. > * I'm not very happy that execSRF.c has two somewhat different > implementations of largely similar functionality, and > SetExprState.elidedFuncState seems like a wart. That's mostly a > pre-existing problem of course. I'm satisfied with leaving it as it is > for now, but eventually some refactoring there would be good. Yea, that's not pretty. Given the historical difference in behaviour between SRF and ROWS FROM (the latter always materializes, the former only when the set function does so itself), I'm not sure they can easily be simplified. I'm not really sure how to get rid of the issue underlying elidedFuncState? I mean we could move that case into nodeFunctionscan, above ExecMakeTableFunctionResult's level, but that doesn't really seem like an improvement. I think there's some argument to be made that we should move all of execSRF.c's logic to their respective callsites. There's not really much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() - and at least the latter would even become a bit simpler if we didn't support both callers. Thanks a lot for your work on the patch! And pushed. - Andres
Andres Freund <andres@anarazel.de> writes: > - The !caseExpr->defresult result branch is currently unreachable (and > its equivalent was before the patch) because transformCaseExpr() > generates a default expression. I'm inclined to replace the dead code > with an assertion. Any reason not to do that? Ah-hah. I'd noted that that code wasn't being reached when I did some code coverage checks this morning, but I hadn't found the cause yet. Yeah, replacing the if-test with an assert seems fine. > I think there's some argument to be made that we should move all of > execSRF.c's logic to their respective callsites. There's not really > much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() - > and at least the latter would even become a bit simpler if we didn't > support both callers. Possibly. All that code could stand to be rethought, probably, now that its mission is just to handle the SRF case. But at least all the cruft is in one place for the moment. And I don't think it's anything we have to fix before v10, it's just cosmetic. > Thanks a lot for your work on the patch! You're welcome! regards, tom lane
On 2017-03-25 12:22:15 -0400, Tom Lane wrote: > More random musing ... have you considered making the jump-target fields > in expressions be relative rather than absolute indexes? That is, > EEO_JUMP would look like > > op += (stepno); \ > EEO_DISPATCH(); \ > > instead of > > op = &state->steps[stepno]; \ > EEO_DISPATCH(); \ > > I have not carried out a full patch to make this work, but just making > that one change and examining the generated assembly code looks promising. > Instead of this > > movslq 40(%r14), %r8 > salq $6, %r8 > addq 24(%rbx), %r8 > movq %r8, %r14 > jmp *(%r8) > > we get this > > movslq 40(%r14), %rax > salq $6, %rax > addq %rax, %r14 > jmp *(%r14) That seems like a good idea. I've not done this in the committed version (and I don't think we necessarily need to this before the release), but fo rthe future it seems like a good plan. It makes sense that it's faster - there's no need to reference state->steps. > which certainly looks like it ought to be faster. Also, the real reason > I got interested in this at all is that with relative jumps, groups of > steps would be position-independent within the steps array, which would > enable some compile-time tricks that seem impractical with the current > definition. Indeed. > BTW, now that I've spent a bit of time looking at the generated assembly > code, I'm kind of disinclined to believe any arguments about how we have > better control over branch prediction with the jump-threading > implementation. I measured the performance difference between using it and not using it, and it came out a pretty clear plus. On gcc 6.3, gcc master snapshot, and clang-3.9. It's not just that more jumps are duplicated, it's also that the switch() always adds a boundary check. > At least with current gcc (6.3.1 on Fedora 25) at -O2, > what I see is multiple places jumping to the same indirect jump > instruction :-(. It's not a total disaster: as best I can tell, all the > uses of EEO_JUMP remain distinct. But gcc has chosen to implement about > 40 of the 71 uses of EEO_NEXT by jumping to the same couple of > instructions that increment the "op" register and then do an indirect > jump :-(. Yea, I see some of that too - "usually" when there's more than just the jump in common. I think there's some gcc variables that influence this (min-crossjump-insns (5), max-goto-duplication-insns (8)). Might be worthwhile experimenting with setting them locally via a pragma or such. I think Aants wanted to experiment with that, too. Then there's also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785 which causes some forms of computed goto (not ours I think) to be deoptimized in gcc. Greetings, Andres Freund
On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de> wrote: >> At least with current gcc (6.3.1 on Fedora 25) at -O2, >> what I see is multiple places jumping to the same indirect jump >> instruction :-(. It's not a total disaster: as best I can tell, all the >> uses of EEO_JUMP remain distinct. But gcc has chosen to implement about >> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of >> instructions that increment the "op" register and then do an indirect >> jump :-(. > > Yea, I see some of that too - "usually" when there's more than just the > jump in common. I think there's some gcc variables that influence this > (min-crossjump-insns (5), max-goto-duplication-insns (8)). Might be > worthwhile experimenting with setting them locally via a pragma or such. > I think Aants wanted to experiment with that, too. I haven't had the time to research this properly, but initial tests show that with GCC 6.2 adding #pragma GCC optimize ("no-crossjumping") fixes merging of the op tail jumps. Some quick and dirty benchmarking suggests that the benefit for the interpreter is about 15% (5% speedup on a workload that spends 1/3 in ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local vars before the indirect jump is somewhere between a tiny benefit and no effect, certainly not worth introducing extra complexity. Clang 3.8 does the correct thing out of the box and is a couple of percent faster than GCC with the pragma. Regards, Ants Aasma
On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote: >On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de> >wrote: >>> At least with current gcc (6.3.1 on Fedora 25) at -O2, >>> what I see is multiple places jumping to the same indirect jump >>> instruction :-(. It's not a total disaster: as best I can tell, all >the >>> uses of EEO_JUMP remain distinct. But gcc has chosen to implement >about >>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of >>> instructions that increment the "op" register and then do an >indirect >>> jump :-(. >> >> Yea, I see some of that too - "usually" when there's more than just >the >> jump in common. I think there's some gcc variables that influence >this >> (min-crossjump-insns (5), max-goto-duplication-insns (8)). Might be >> worthwhile experimenting with setting them locally via a pragma or >such. >> I think Aants wanted to experiment with that, too. > >I haven't had the time to research this properly, but initial tests >show that with GCC 6.2 adding > >#pragma GCC optimize ("no-crossjumping") > >fixes merging of the op tail jumps. > >Some quick and dirty benchmarking suggests that the benefit for the >interpreter is about 15% (5% speedup on a workload that spends 1/3 in >ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local >vars before the indirect jump is somewhere between a tiny benefit and >no effect, certainly not worth introducing extra complexity. Clang 3.8 >does the correct thing out of the box and is a couple of percent >faster than GCC with the pragma. That's large enough to be worth doing (although I recall you seeing all jumps commonalized). We should probably do thison a per function basis however (either using pragma push option, or function attributes). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote: >> I haven't had the time to research this properly, but initial tests >> show that with GCC 6.2 adding >> >> #pragma GCC optimize ("no-crossjumping") >> >> fixes merging of the op tail jumps. >> >> Some quick and dirty benchmarking suggests that the benefit for the >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local >> vars before the indirect jump is somewhere between a tiny benefit and >> no effect, certainly not worth introducing extra complexity. Clang 3.8 >> does the correct thing out of the box and is a couple of percent >> faster than GCC with the pragma. > That's large enough to be worth doing (although I recall you seeing all jumps commonalized). We should probably do thison a per function basis however (either using pragma push option, or function attributes). Seems like it would be fine to do it on a per-file basis. If you're worried about pessimizing the out-of-line subroutines, we could move those to a different file --- it's pretty questionable that they're in execExprInterp.c in the first place, considering they're meant to be used by more than just that execution method. regards, tom lane
On 2017-03-25 23:51:45 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote: > >> I haven't had the time to research this properly, but initial tests > >> show that with GCC 6.2 adding > >> > >> #pragma GCC optimize ("no-crossjumping") > >> > >> fixes merging of the op tail jumps. > >> > >> Some quick and dirty benchmarking suggests that the benefit for the > >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in > >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local > >> vars before the indirect jump is somewhere between a tiny benefit and > >> no effect, certainly not worth introducing extra complexity. Clang 3.8 > >> does the correct thing out of the box and is a couple of percent > >> faster than GCC with the pragma. > > > That's large enough to be worth doing (although I recall you seeing all jumps commonalized). We should probably do thison a per function basis however (either using pragma push option, or function attributes). > > Seems like it would be fine to do it on a per-file basis. I personally find per-function annotation ala __attribute__((optimize("no-crossjumping"))) cleaner anyway. I tested that, and it seems to work. Obviously we'd have to hide that behind a configure test. Could also do tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier. > If you're > worried about pessimizing the out-of-line subroutines, we could move > those to a different file --- it's pretty questionable that they're > in execExprInterp.c in the first place, considering they're meant to be > used by more than just that execution method. I indeed am, but having the code in the same file has a minor advantage: It allows the compiler to partially inline them, if it feels like it (e.g. moving null checks inline). Greetings, Andres Freund
On 2017-03-25 20:59:27 -0700, Andres Freund wrote: > On 2017-03-25 23:51:45 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote: > > >> I haven't had the time to research this properly, but initial tests > > >> show that with GCC 6.2 adding > > >> > > >> #pragma GCC optimize ("no-crossjumping") > > >> > > >> fixes merging of the op tail jumps. > > >> > > >> Some quick and dirty benchmarking suggests that the benefit for the > > >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in > > >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local > > >> vars before the indirect jump is somewhere between a tiny benefit and > > >> no effect, certainly not worth introducing extra complexity. Clang 3.8 > > >> does the correct thing out of the box and is a couple of percent > > >> faster than GCC with the pragma. > > > > > That's large enough to be worth doing (although I recall you seeing all jumps commonalized). We should probably dothis on a per function basis however (either using pragma push option, or function attributes). > > > > Seems like it would be fine to do it on a per-file basis. > > I personally find per-function annotation ala > __attribute__((optimize("no-crossjumping"))) > cleaner anyway. I tested that, and it seems to work. > > Obviously we'd have to hide that behind a configure test. Could also do > tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier. Checking for this isn't entirely pretty - see my attached attempt at doing so. I considered hiding __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that makes things better. Comments? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: >> I personally find per-function annotation ala >> __attribute__((optimize("no-crossjumping"))) >> cleaner anyway. I tested that, and it seems to work. >> >> Obviously we'd have to hide that behind a configure test. Could also do >> tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier. Agreed. > Checking for this isn't entirely pretty - see my attached attempt at > doing so. I considered hiding > __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind > a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that > makes things better. I think it would, primarily because if we find out that some other compiler spells this differently, we could handle it totally within configure. Isn't our practice to put __attribute__ at the end of a function declaration or definition, not in the middle someplace? regards, tom lane
On 2017-03-27 09:33:43 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Checking for this isn't entirely pretty - see my attached attempt at > > doing so. I considered hiding > > __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind > > a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that > > makes things better. > > I think it would, primarily because if we find out that some other compiler > spells this differently, we could handle it totally within configure. I'm unconvinced that we could sensibly map different compiler's options on the same option name - I'd be surprised if they would have a similar enough effect. > Isn't our practice to put __attribute__ at the end of a function > declaration or definition, not in the middle someplace? We could move it to the declaration - which doesn't seem like an improvement here, given that it's about optimization not API changes - but in definitions you can't move it after the function name: static Datum ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull) #ifdef HAVE_FUNCATTR_NO_CROSSJUMPING __attribute__((optimize("no-crossjumping"))) #endif : /home/andres/src/postgresql/src/backend/executor/execExprInterp.c:279:1: error: attributes should be specified before thedeclarator in a function definition Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-27 09:33:43 -0400, Tom Lane wrote: >> I think it would, primarily because if we find out that some other compiler >> spells this differently, we could handle it totally within configure. > I'm unconvinced that we could sensibly map different compiler's options > on the same option name - I'd be surprised if they would have a similar > enough effect. [ shrug... ] OK, next question is whether pgindent treats this layout sanely. regards, tom lane
On 2017-03-27 11:22:40 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-27 09:33:43 -0400, Tom Lane wrote: > >> I think it would, primarily because if we find out that some other compiler > >> spells this differently, we could handle it totally within configure. > > > I'm unconvinced that we could sensibly map different compiler's options > > on the same option name - I'd be surprised if they would have a similar > > enough effect. > > [ shrug... ] OK, next question is whether pgindent treats this > layout sanely. The patch was run through pgindent, i.e. it seems to be ok with it. - Andres
As to the point of whether it actually helps or not ... on gcc 6.3.1 (Fedora 25), yes it does. Seems to be one "jmp *something" per EEO_NEXT or EEO_JUMP. on gcc 4.4.7 (RHEL 6), it makes things *WORSE*. We go from about half of the dispatches getting routed through a common location, to *all* of them (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF survives as a separate jump). This seems like a bug, but there it is. So this means we'd need some serious research to decide whether to apply it. And I'm suspecting we'd end up with a compiler version test. regards, tom lane
I wrote: > As to the point of whether it actually helps or not ... > on gcc 4.4.7 (RHEL 6), it makes things *WORSE*. We go from about half of > the dispatches getting routed through a common location, to *all* of them > (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF > survives as a separate jump). This seems like a bug, but there it is. So after a bit of googling, this is a very longstanding complaint: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785 (hm, think I know the second submitter) I'm not sure we should be relying on a gcc bug fix that's barely four months old. Even assuming it fixes the issue without new regressions, most people are not going to have it anytime soon. My feeling at this point is that we might be better off disabling the computed-goto case by default. At the very least, we're going to need a version check that restricts it to latest gcc. regards, tom lane
On 2017-03-27 11:52:05 -0400, Tom Lane wrote: > As to the point of whether it actually helps or not ... > > on gcc 6.3.1 (Fedora 25), yes it does. Seems to be one "jmp *something" > per EEO_NEXT or EEO_JUMP. > > on gcc 4.4.7 (RHEL 6), it makes things *WORSE*. We go from about half of > the dispatches getting routed through a common location, to *all* of them > (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF > survives as a separate jump). This seems like a bug, but there it is. Gah :(. I have gcc 5,6,7 here, but nothing earlier. I'm now inclined to go the version check routine, for the individual versions we can (and want) confirm this on... I'm not too concerned about not doing so on gcc 4.4 or older... > So this means we'd need some serious research to decide whether to apply > it. And I'm suspecting we'd end up with a compiler version test. Indeed. - Andres
On 2017-03-27 12:18:37 -0400, Tom Lane wrote: > I wrote: > > As to the point of whether it actually helps or not ... > > on gcc 4.4.7 (RHEL 6), it makes things *WORSE*. We go from about half of > > the dispatches getting routed through a common location, to *all* of them > > (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF > > survives as a separate jump). This seems like a bug, but there it is. > > So after a bit of googling, this is a very longstanding complaint: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284 > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785 > > (hm, think I know the second submitter) I don't think that's precisely the same issue - as long as some of the goto branches survive, we're not hitting the full brunt of the compgoto thing. I think we're essentially avoiding some of that because we're "precomputing" the dispatch_table lookup. To count the number of jumps I used: gdb -batch -ex 'disassemble/s ExecInterpExpr' execExprInterp.o|grep jmpq|grep -v ExecInterpExpr|wc -l which'll only include indirect jumps (since otherwise jumps will look like "jmpq 0x35f2 <ExecInterpExpr+2066>") standard flags -fno-crossjumping gcc-5 (5.4.1-8): 34 82 gcc-6 (6.3.0-8): 34 82 gcc-7 (7.0.1): 71 108 gcc-snapshot: 72 108 So that doesn't look too bad. > My feeling at this point is that we might be better off disabling > the computed-goto case by default. At the very least, we're going > to need a version check that restricts it to latest gcc. In my measurements it's still faster in at least gcc-5/6, even without the option (largely because it avoids array bounds checks on the jump table built for the switch). Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-03-27 12:18:37 -0400, Tom Lane wrote: >> My feeling at this point is that we might be better off disabling >> the computed-goto case by default. At the very least, we're going >> to need a version check that restricts it to latest gcc. > In my measurements it's still faster in at least gcc-5/6, even without > the option (largely because it avoids array bounds checks on the jump > table built for the switch). Hm. What test cases are you using? regards, tom lane
On 2017-03-27 13:30:11 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-03-27 12:18:37 -0400, Tom Lane wrote: > >> My feeling at this point is that we might be better off disabling > >> the computed-goto case by default. At the very least, we're going > >> to need a version check that restricts it to latest gcc. > > > In my measurements it's still faster in at least gcc-5/6, even without > > the option (largely because it avoids array bounds checks on the jump > > table built for the switch). > > Hm. What test cases are you using? I used tpc-h - seems like a realistic enough testcase. Andres