Thread: Redesigning the executor (async, JIT, memory efficiency)
Hi, The current executor structure limits us in a number of ways that are becoming problematic. I'll outline in this email how I think we should rework it. Including a prototype for the first, most drastic, steps. The primary problems with the current design that I want to address are: 1) Executor structure makes it hard to do asynchronous execution. That comes into play in several cases. For example we want to be able to continue execution if no tuples are ready in one FDW below an append node, instead switch to another one. Similarly that could be relevant for parallel queries, with multiple gather nodes. A bit further out that'd also be very useful for continuing execution in a different part of the tree once blocked on IO. That'd e.g. be useful to build several hashtables for hashjoins at once. There's some limitations of how we do IO for that atm however, it'd be easier to efficiently implement this if we had AIO support. With the current, tree-walking based architecture, it's hard to implement something like that, because we basically have to re-enter into the subtrees where we'd stopped. That'd requires adding new state-machine logic in every tree, including additional branches. What we basically need here is the ability to stop execution inside a specific executor node and have a multiplexer node (which can use WaitEventSets or such) decide which part of the execution tree to continue executing instead. 2) The tree-walking based approach makes it very hard to JIT compile the main portion of query execution without manually re-emitting all of executor/, which is obviously entirely unattractive. For the expression evaluation JIT the biggest benefit comes from removing the indirect branches between the expression steps and then, a close second, from having more efficient implementations of hot-path expression steps. That approach currently isn't feasible for the whole executor. The main sources for indirect branches / calls are ExecProcNode() and ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make those constant we'd basically have to manually emit code for all of these (or attempt to rely on the inliner, but that doesn't work across deep, potentially self invoking, trees). The expression code, which also used to be tree walking, avoids that now by having the indirect function calls nearly exclusively at the top-level. Secondary issues that I also want to address, or at least get ready to address, are: 3) Reduce the overhead of executor startup. In a number of workloads involving simple queries we currently spend most of the time within ExecInitNode(). There's two major problems here: For one a lot of work is done at executor initialization that should be part of planning (especially creating tupledescs and nodeAgg initialization). The other problem is that initialization does a lot of tiny allocations. 4) Make the executor faster, leaving JIT aside. Moving to a "push based" executor can yield noticable speedups: Right now returning a single tuple will often pass through a lot of indirect function calls (via ExecProcNode()). By moving to a push based executor (where the first executed piece is reachable without any recursion), this can be significantly cheaper. We also exercise the stack *a lot* right now, constantly pushing/poping stuff onto it that's not going to be used anytime soon. Profiling shows that stack usage is quite a bottleneck right now. 5) Memory overhead of query execution. Right now executing a prepared statment has quite a massive memory overhead. We copy around a lot of memory, initialize a lot of nodes from scratch, etc. It'd be far better if we could share a lot of state between successive executions of the same prepared statement. 6) Plan ahead for vectorized / batched execution. Dealing with tuples one-by-one is inefficient from a cache locality, code locality and CPU pipelining perspective. Instead fully processing a single tuple from a page (evaluating qual, then joining it to other things etc), it's a lot more efficient to evaluate the qual for all the tuples on a page, and then pass all the matching tuples to the next execution step. After playing with / pondering about various approaches, I think the right way to approach this is to convert the tree-walk based execution with a 'program' / bytecode interpreter based approach. Quite similar to what we've done to expression evaluation in v10 (which then subsequently was used to implement JIT compilation of expression evaluation in v11). To address the memory density / startup overhead issue I think we need to work towards two things: First, most of the memory needed for a query should come from a single, accurately sized, allocation. Secondly, the writable memory for query execution should be referenced using memory offsets, rather than pointers. That way we can clone the writable memory for query execution. That's obviously quite a bit of a project. Or rather projects. I've started prototyping a solution for this. The code for the prototype is at https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite (click on the repo to see the git url) and I plan to continue working on that project in that git branch. In my first iteration I'd tried to basically address 1-4 in a single protytype, using a single pass over the plan tree, without using the current executor initialization. Not impossible, but very very unwieldy. In the second version of the prototype I'm instead only tackling 1-2, and reuse the current executor node initialization functions. This prototype adds support for generating and executing an opcode based execution plan for a subset of query nodes (seqscan, index [only] scan, hash / sort based aggregation without grouping sets, limit, inner nestloop / hashjoin without batches) when the use_linearized_plan is set and all nodes in the query are supported. For example, this converts this converts TPCH's Q01: tpch_1[26142][1]=# SET client_min_messages ='log'; tpch_1[26142][1]=# SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - interval '74 days' GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus; LOG: 00000: pp: into: 0: init_sort 1: seqscan_first 2: seqscan [j empty 5] > s0 3: qual [j fail 2] < scan s0 4: hashagg_tuple [j 2] < s0 5: drain_hashagg [j empty 7] > s1 6: sort_tuple [j 5] < s1 7: sort 8: drain_sort [j empty 10] > s2 9: return < s2 [next 8] 10: done where s are numbered slots, j are, potentially conditional, jumps. This works for a number of plans, but there's several where we currently bail out. The basic idea is to convert the current, largely unmodified, Plan tree into a "linear" representation of steps that can be executed by an interpreter similar to ExecInterpExpr(). Individual parts of a query tree form "instructions" which together form a small program to execute the query. Obviously such a query will contain (conditional) backward and forward jumps (which differentiates it from the expression case, which knows only forward jumps). This means that some of the current executor nodes get split into multiple parts, because it's not permitted to recurse from within an executor step. A tuple can be returned to the user / caller by a special 'return tuple' step. This just sets the "virtual" instruction pointer to the point where execution should resume, allowing execution to continue at the right spot when the next tuple is supposed to be returned (e.g. for cursors). Instead of, as currently in the prototype, returning control flow back to ExecutePlan() we could - and quite possibly should - instead just process the tuple by having a 'send_slot' step or such. But right now it's a good example how control flow can be interrupted at any boundary. For the async FDW (and gather, and IO, ...) cases, you'd have a 'scan_fdw' step that'd normally continue execution, but if there's no tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet to watch which of the to-be-waited-for FDWs is ready. Despite being entirely unoptimized this already yields a measurable speedup over the current executor for the TPCH queries it supports. I think the big argument against this kind of project is that it's going to be quite invasive. Not everything inside the executor will have to be touched, but it's going to be a significant portion. But I don't think there's really a way around that for the medium to long term future, and I don't see how we'd gain anything by waiting further. My (although I certainly hope/plan to not work on this alone) basic plan to attack this is: 1) Expand prototype to cover all SELECT queries. That's mostly "just work". This'll reuse the current executor initialization with minimal modifications. 2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the best way to do so is - potentially it's just generating slightly different programs, where the steps contain additional code to perform the instrumentation. 3) Reimplement EvalPlanQual() infrastructure by generating a separate execution plan for EPQ, which replaces the relevant scan nodes with an execution step that return tuples from estate->es_epqTuple. The current indirection / branches due to ExecScan() are fairly expensive. 4) Remove old executor *Exec() functions, and subsidiary code. 5) Clean up executor nodes, they should now require far less information, because a lot of the state will be in executor steps. 6) Introduce a new pass, run before ExecInitNode(), that accounts for all the necessary memory for, at least, nodes, slots, expression contexts. Then convert ExecInitNode() to allocate from that. I'd assume that we'd leave expression evaluation out of that, because such work would be a largely independent project. 7) Split state in the executor nodes / executor program steps into modifiable and non-modifiable parts, and store as much as reasonably possible using relative addresses rather than absolute pointers. That can then be used to make executor programs reusable and reentrant. That's going to be useful for at least prepared statement, plpgsql and better JIT code generation. 8) Write a JIT implementation. Thoughts? Regards, Andres
Cool stuff! On 25/05/18 06:35, Andres Freund wrote: > For example, this converts this converts TPCH's Q01: > > tpch_1[26142][1]=# SET client_min_messages ='log'; > tpch_1[26142][1]=# SELECT > l_returnflag, > l_linestatus, > sum(l_quantity) AS sum_qty, > sum(l_extendedprice) AS sum_base_price, > sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, > avg(l_quantity) AS avg_qty, > avg(l_extendedprice) AS avg_price, > avg(l_discount) AS avg_disc, > count(*) AS count_order > FROM > lineitem > WHERE > l_shipdate <= date '1998-12-01' - interval '74 days' > GROUP BY > l_returnflag, > l_linestatus > ORDER BY > l_returnflag, > l_linestatus; > LOG: 00000: pp: > > into: > > 0: init_sort > 1: seqscan_first > 2: seqscan [j empty 5] > s0 > 3: qual [j fail 2] < scan s0 > 4: hashagg_tuple [j 2] < s0 > 5: drain_hashagg [j empty 7] > s1 > 6: sort_tuple [j 5] < s1 > 7: sort > 8: drain_sort [j empty 10] > s2 > 9: return < s2 [next 8] > 10: done > > where s are numbered slots, j are, potentially conditional, jumps. This > works for a number of plans, but there's several where we currently bail > out. How is this going to look like in EXPLAIN, even without ANALYZE? Would it still show a tree plan, or this linear program, or both? > Despite being entirely unoptimized this already yields a measurable > speedup over the current executor for the TPCH queries it supports. Oh, that's impressive! - Heikki
Hi, On 2018-05-25 09:26:47 +0300, Heikki Linnakangas wrote: > Cool stuff! Thanks! > On 25/05/18 06:35, Andres Freund wrote: > > For example, this converts this converts TPCH's Q01: > > > > tpch_1[26142][1]=# SET client_min_messages ='log'; > > tpch_1[26142][1]=# SELECT > > l_returnflag, > > l_linestatus, > > sum(l_quantity) AS sum_qty, > > sum(l_extendedprice) AS sum_base_price, > > sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, > > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, > > avg(l_quantity) AS avg_qty, > > avg(l_extendedprice) AS avg_price, > > avg(l_discount) AS avg_disc, > > count(*) AS count_order > > FROM > > lineitem > > WHERE > > l_shipdate <= date '1998-12-01' - interval '74 days' > > GROUP BY > > l_returnflag, > > l_linestatus > > ORDER BY > > l_returnflag, > > l_linestatus; > > LOG: 00000: pp: > > > > into: > > > > 0: init_sort > > 1: seqscan_first > > 2: seqscan [j empty 5] > s0 > > 3: qual [j fail 2] < scan s0 > > 4: hashagg_tuple [j 2] < s0 > > 5: drain_hashagg [j empty 7] > s1 > > 6: sort_tuple [j 5] < s1 > > 7: sort > > 8: drain_sort [j empty 10] > s2 > > 9: return < s2 [next 8] > > 10: done > > > > where s are numbered slots, j are, potentially conditional, jumps. This > > works for a number of plans, but there's several where we currently bail > > out. > > How is this going to look like in EXPLAIN, even without ANALYZE? Would it > still show a tree plan, or this linear program, or both? I think we're going to have to continue showing the tree plan. I think the linear version is far too hard to understand for anything nontrivial. I'd definitely continue to expect the Plan tree to be tree shaped, and I'm currently not forseeing to entirely get rid of PlanState trees (albeit with smaller nodes, and potentially with relative rather than absolute pointers to children). So I don't really forsee a huge problem continuing to display a tree shaped plan. I'd expect there'd be an EXPLAIN option to show the linear plan, but that that'd not be used frequently outside of development work. Greetings, Andres Freund
On 05/24/2018 11:26 PM, Heikki Linnakangas wrote: > Cool stuff! > > On 25/05/18 06:35, Andres Freund wrote: >> For example, this converts this converts TPCH's Q01: +1 Wicked cool! -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Fri, May 25, 2018 at 2:40 AM, Andres Freund <andres@anarazel.de> wrote: > I think we're going to have to continue showing the tree plan. I think > the linear version is far too hard to understand for anything > nontrivial. Some of that is because this format is intrinsically hard to read, but it's also partly because of unfamiliarity and partly because you haven't put much work into it. You could certainly make it easier to read just by expending a little more effort on it: 0: Initialize Sort (for step 7) 1: Initialize Seq Scan on lineitem (for step 2) 2: Iterate Seq Scan on linitem into slot 0 If no more tuples, go to step 5. 3: Test slot 0 using <deparsed qual> If failed, go to step 2. 4: Load Hash Aggregate from slot 0. Go to step 2. 5: Retrieve from Hash Aggregate into slot 1. If no more tuples, go to step 7. etc. Even with this sort of TLC, I agree that reading this is no picnic, but I think continuing to show the tree structure when the executing plan has been linearized seems to have its own set of difficulties. The way that time is reported today in EXPLAIN ANALYZE presumes that execute recurses down the tree; that's how times for higher nodes end up including times for subordinate nodes. If we time each step executed using this method, that will no longer be true. We could make it true. For example we could make the Seq Scan time include steps 1, 2, and 3, the Hash Aggregate on top of it include that time plus steps 4 and 5, the Sort on top of that also include steps 0, 6, 7, and 8, etc. But that sucks. If we've got timing information for each step individually, that's awesome, and power users are going to want to see it. One of the most common questions one has when looking at today's EXPLAIN ANALYZE output is "where did the time go?" and we want to provide the most accurate possible answer to that question. Also, consider a plan that today looks like this: Append -> Seq Scan on p1 -> Seq Scan on p2 -> Seq Scan on p3 The linearized plan could omit the Append node altogether. We just went and made Append nodes have a cost because they do, but in cases where the Append is neither parallel-aware nor the new async-capable Append type we're presumably going to add nor doing partition pruning, we can and probably should optimize it away completely. That will change which plan is fastest in some cases, I'm fairly sure. It doesn't keep you from reporting on the Append node as if it still existed, but if you do, you're sorta lying. Another case that I think is kind of interesting is: Nested Loop Semi Join -> Seq Scan -> Index Scan Index Cond: ... If the index cond passes, we can emit the tuple returned by the Seq Scan and jump back to fetch the next tuple from the Seq Scan. Perhaps the Nested Loop doesn't really exist at all. If the join is something other than a semi/anti join then you probably need the join to project, unless all of the columns needed are from one side, but that's all, unless the join itself has a Join Filter. I think users are going to want to see EXPLAIN ANALYZE output that makes it really clear how the query has been optimized. Sure, the existing format may be easier for users in the short term, and certainly can be an option, but seeing something that conceals rather than reveals how the execution is really progressing may not be all that great in the end. Maybe there are other ways of displaying the linearized plan that would increase comprehensibility further. For example, in your version of the plan for TPC-H Q01, steps 0-1 are initialization, 2-4 are a loop, 5-6 are a loop, 7 is an independent step, 8-9 are another loop, and 10 represents completion. Maybe that can be used somehow to make something easier to understand. For example: Initialize -> Initialize Sort 1 -> Initiialize Seq Scan on lineitem Loop -> Iterate Seq Scan on lineitem Break if no more tuples -> Continue if ... -> Load Into Hash Aggregate 1 Loop -> Fetch From Hash Aggregate 1 -> Load Into Sort 1 Perform Sort 1 Loop -> Fetch from Sort 1 -> Return Tuple Maybe that doesn't seem that much easier, but at least it doesn't involve needing to understand step numbers. It strikes me that this kind of organization could potentially also be useful for a query progress indicator. If the whole plan is a series of loops and other steps, you can say things like -- well, the steps we've already completed represent 30% of the estimated cost. The steps we haven't started yet represent another 30% of the estimated cost. So we're somewhere between 30% done and 70% done. And then we can guess what fraction of the iterations we've completed for the loop we're currently performing. Obviously this is all pie-in-the-sky, but it certainly seems like this kind of thing would enable estimation that we can't easily perform today. How will a linearized plan handle handle backward fetches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-05-26 17:08:51 -0400, Robert Haas wrote: > On Fri, May 25, 2018 at 2:40 AM, Andres Freund <andres@anarazel.de> wrote: > > I think we're going to have to continue showing the tree plan. I think > > the linear version is far too hard to understand for anything > > nontrivial. > > Some of that is because this format is intrinsically hard to read, but > it's also partly because of unfamiliarity and partly because you > haven't put much work into it. You could certainly make it easier to > read just by expending a little more effort on it: Right - didn't seem like the highest priority thing to tackle. But without displaying hierarchy even a lot of TLC won't make it easy to understand. Being able to quickly ignore, or even hide if you use something like explain.depesz.com, entire subtrees is quite valuable. We don't compose programs in one function + a lot of goto statements for a reason... > Even with this sort of TLC, I agree that reading this is no picnic, > but I think continuing to show the tree structure when the executing > plan has been linearized seems to have its own set of difficulties. I'm not so sure. The plan would still be in a tree form, and I do think realistically people do and should care more about the planner choices rather than minute details of how things are actually executed. Especially if we're going to whack that around on a granular level repeatedly. > Also, consider a plan that today looks like this: > > Append > -> Seq Scan on p1 > -> Seq Scan on p2 > -> Seq Scan on p3 > > The linearized plan could omit the Append node altogether. Sure, but instrumentation nodes could continue to have a reference to the Plan/Append node as their parent. > How will a linearized plan handle handle backward fetches? I don't quite know yet, TBH. I'm playing with a few alternatives, including: a) Don't support it in general, require materialization at the top level. Has the advantage of minimal complexity that's only incurred when actually needed. And I think scrolling backwards is extremely rare in practice. b) Emit a separate execution program the first time a different direction is required. But share enough state that the program can just be switched over. That'd allow to remove all the necessary branches to support backward scan from the hot paths. c) Continue as currently done, essentially. Greetings, Andres Freund
I think we're going to have to continue showing the tree plan. I think
the linear version is far too hard to understand for anything
nontrivial.
Hey Andres, what you're pitching here is very similar to the way DB2 works. It actually has two different explain tools that dumps the two different formats. The tree dump is the preferred format, but there are times where the human-readable dump of the executor steps is very useful. Particularly, as mentioned elsewhere in this thread, when the executor steps diverge from the plan.
One thing that helps correlate the two formats is that each node in the tree dump is numbered and each step in the executor is annotated with the corresponding tree node number.
Another tool that was invaluable is a readable dump (effectively a disassembly) of the byte code. It was only useful to developers, but when something goes wrong in the executor it was incredibly useful to see exactly what the byte code was specifying (as opposed the human readable format of explain, which could hide subtle details.)
- Doug
Salesforce
Thanks! I've been waiting for this. At Thu, 24 May 2018 20:35:39 -0700, Andres Freund <andres@anarazel.de> wrote in <20180525033538.6ypfwcqcxce6zkjj@alap3.anarazel.de> > The current executor structure limits us in a number of ways that are > becoming problematic. I'll outline in this email how I think we should > rework it. Including a prototype for the first, most drastic, steps. > > The primary problems with the current design that I want to address are: > > 1) Executor structure makes it hard to do asynchronous execution. That > comes into play in several cases. For example we want to be able to > continue execution if no tuples are ready in one FDW below an append > node, instead switch to another one. Similarly that could be relevant > for parallel queries, with multiple gather nodes. > > A bit further out that'd also be very useful for continuing execution > in a different part of the tree once blocked on IO. That'd e.g. be > useful to build several hashtables for hashjoins at once. There's > some limitations of how we do IO for that atm however, it'd be easier > to efficiently implement this if we had AIO support. > > With the current, tree-walking based architecture, it's hard to > implement something like that, because we basically have to re-enter > into the subtrees where we'd stopped. That'd requires adding new > state-machine logic in every tree, including additional branches. > > What we basically need here is the ability to stop execution inside a > specific executor node and have a multiplexer node (which can use > WaitEventSets or such) decide which part of the execution tree to > continue executing instead. Yeah. Finally async execution requires that capability. > 2) The tree-walking based approach makes it very hard to JIT compile the > main portion of query execution without manually re-emitting all of > executor/, which is obviously entirely unattractive. > > For the expression evaluation JIT the biggest benefit comes from > removing the indirect branches between the expression steps and then, > a close second, from having more efficient implementations of > hot-path expression steps. > > That approach currently isn't feasible for the whole executor. The > main sources for indirect branches / calls are ExecProcNode() and > ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make > those constant we'd basically have to manually emit code for all of > these (or attempt to rely on the inliner, but that doesn't work > across deep, potentially self invoking, trees). The expression code, > which also used to be tree walking, avoids that now by having the > indirect function calls nearly exclusively at the top-level. I looked into this a bit. One of most hard point in flattening executor (in my previous attempt) *was* how to inside-out'ing the Exec* functions, especially Agg/WindowAgg. The highlight of the hardness was it could be utterly unreadable if *I* reconstruct, say WindowAgg, or non-hash-Agg into functions that is called with tuples from underlying nodes. In the repo, nodeAgg is flattened only for the hash-agg case so it seems rather straight-forward but I'm afraid that such complex nodes becomes collections of mystic fractions that should be work when called in the sequence written as a magic spell casted by the compiler part.. (Sorry in advance for the maybe hard-to-read blob.) One related concern is finally the "mystic fractions" are no longer node-private but global ones called from ExecExecProgram. Essentially JIT'ing is a kind of such and in the current ExecExpr case, it seems to me tolerablly small. However, I'm not sure it is maintainable (for moderate developers) if all such fractions are connected via ExecExecProgram, following a magic spell. > Secondary issues that I also want to address, or at least get ready to > address, are: > > > 3) Reduce the overhead of executor startup. In a number of workloads > involving simple queries we currently spend most of the time within > ExecInitNode(). There's two major problems here: For one a lot of > work is done at executor initialization that should be part of > planning (especially creating tupledescs and nodeAgg > initialization). The other problem is that initialization does a lot > of tiny allocations. > > > 4) Make the executor faster, leaving JIT aside. > > Moving to a "push based" executor can yield noticable speedups: Right > now returning a single tuple will often pass through a lot of > indirect function calls (via ExecProcNode()). By moving to a push > based executor (where the first executed piece is reachable without > any recursion), this can be significantly cheaper. We also exercise > the stack *a lot* right now, constantly pushing/poping stuff onto it > that's not going to be used anytime soon. Profiling shows that stack > usage is quite a bottleneck right now. > > > 5) Memory overhead of query execution. > > Right now executing a prepared statment has quite a massive memory > overhead. We copy around a lot of memory, initialize a lot of nodes > from scratch, etc. It'd be far better if we could share a lot of > state between successive executions of the same prepared statement. > > > 6) Plan ahead for vectorized / batched execution. > > Dealing with tuples one-by-one is inefficient from a cache locality, > code locality and CPU pipelining perspective. Instead fully > processing a single tuple from a page (evaluating qual, then joining > it to other things etc), it's a lot more efficient to evaluate the > qual for all the tuples on a page, and then pass all the matching > tuples to the next execution step. > > > After playing with / pondering about various approaches, I think the > right way to approach this is to convert the tree-walk based execution > with a 'program' / bytecode interpreter based approach. Quite similar to > what we've done to expression evaluation in v10 (which then subsequently > was used to implement JIT compilation of expression evaluation in v11). I agree that the direction overall, but I have a concern about maintainability as above. > To address the memory density / startup overhead issue I think we need > to work towards two things: First, most of the memory needed for a query > should come from a single, accurately sized, allocation. Secondly, the > writable memory for query execution should be referenced using memory > offsets, rather than pointers. That way we can clone the writable memory > for query execution. > > > That's obviously quite a bit of a project. Or rather projects. > > > I've started prototyping a solution for this. The code for the > prototype is at > https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite > (click on the repo to see the git url) > and I plan to continue working on that project in that git branch. > > > In my first iteration I'd tried to basically address 1-4 in a single > protytype, using a single pass over the plan tree, without using the > current executor initialization. Not impossible, but very very unwieldy. > > In the second version of the prototype I'm instead only tackling 1-2, > and reuse the current executor node initialization functions. This > prototype adds support for generating and executing an opcode based > execution plan for a subset of query nodes (seqscan, index [only] scan, > hash / sort based aggregation without grouping sets, limit, inner > nestloop / hashjoin without batches) when the use_linearized_plan is set > and all nodes in the query are supported. > > For example, this converts this converts TPCH's Q01: > > tpch_1[26142][1]=# SET client_min_messages ='log'; > tpch_1[26142][1]=# SELECT > l_returnflag, > l_linestatus, > sum(l_quantity) AS sum_qty, > sum(l_extendedprice) AS sum_base_price, > sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, > avg(l_quantity) AS avg_qty, > avg(l_extendedprice) AS avg_price, > avg(l_discount) AS avg_disc, > count(*) AS count_order > FROM > lineitem > WHERE > l_shipdate <= date '1998-12-01' - interval '74 days' > GROUP BY > l_returnflag, > l_linestatus > ORDER BY > l_returnflag, > l_linestatus; > LOG: 00000: pp: > > into: > > 0: init_sort > 1: seqscan_first > 2: seqscan [j empty 5] > s0 > 3: qual [j fail 2] < scan s0 > 4: hashagg_tuple [j 2] < s0 > 5: drain_hashagg [j empty 7] > s1 > 6: sort_tuple [j 5] < s1 > 7: sort > 8: drain_sort [j empty 10] > s2 > 9: return < s2 [next 8] > 10: done > > where s are numbered slots, j are, potentially conditional, jumps. This > works for a number of plans, but there's several where we currently bail > out. > > The basic idea is to convert the current, largely unmodified, Plan tree > into a "linear" representation of steps that can be executed by an > interpreter similar to ExecInterpExpr(). Individual parts of a query > tree form "instructions" which together form a small program to execute > the query. Obviously such a query will contain (conditional) backward > and forward jumps (which differentiates it from the expression case, > which knows only forward jumps). > > This means that some of the current executor nodes get split into > multiple parts, because it's not permitted to recurse from within an > executor step. > > A tuple can be returned to the user / caller by a special 'return tuple' > step. This just sets the "virtual" instruction pointer to the point > where execution should resume, allowing execution to continue at the > right spot when the next tuple is supposed to be returned (e.g. for > cursors). Instead of, as currently in the prototype, returning control > flow back to ExecutePlan() we could - and quite possibly should - > instead just process the tuple by having a 'send_slot' step or such. > But right now it's a good example how control flow can be interrupted at > any boundary. > > For the async FDW (and gather, and IO, ...) cases, you'd have a > 'scan_fdw' step that'd normally continue execution, but if there's no > tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet > to watch which of the to-be-waited-for FDWs is ready. Finally, the executor will become push-based where all "source" nodes conducts execution, not only by FDW nodes but all source nodes including local scans, I suppose. > Despite being entirely unoptimized this already yields a measurable > speedup over the current executor for the TPCH queries it supports. > > > I think the big argument against this kind of project is that it's going > to be quite invasive. Not everything inside the executor will have to be > touched, but it's going to be a significant portion. But I don't think > there's really a way around that for the medium to long term future, and > I don't see how we'd gain anything by waiting further. > > > My (although I certainly hope/plan to not work on this alone) basic plan > to attack this is: > > 1) Expand prototype to cover all SELECT queries. That's mostly "just > work". This'll reuse the current executor initialization with minimal > modifications. > > 2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the > best way to do so is - potentially it's just generating slightly > different programs, where the steps contain additional code to > perform the instrumentation. > > 3) Reimplement EvalPlanQual() infrastructure by generating a separate > execution plan for EPQ, which replaces the relevant scan nodes with > an execution step that return tuples from estate->es_epqTuple. The > current indirection / branches due to ExecScan() are fairly > expensive. > > 4) Remove old executor *Exec() functions, and subsidiary code. > > 5) Clean up executor nodes, they should now require far less > information, because a lot of the state will be in executor steps. > > 6) Introduce a new pass, run before ExecInitNode(), that accounts for > all the necessary memory for, at least, nodes, slots, expression > contexts. Then convert ExecInitNode() to allocate from that. I'd > assume that we'd leave expression evaluation out of that, because > such work would be a largely independent project. > > 7) Split state in the executor nodes / executor program steps into > modifiable and non-modifiable parts, and store as much as reasonably > possible using relative addresses rather than absolute pointers. > That can then be used to make executor programs reusable and > reentrant. That's going to be useful for at least prepared > statement, plpgsql and better JIT code generation. > > 8) Write a JIT implementation. > > > Thoughts? > > > Regards, > > Andres regards. -- Kyotaro Horiguchi NTT Open Source Software Center