Thread: Redesigning the executor (async, JIT, memory efficiency)

Redesigning the executor (async, JIT, memory efficiency)

From
Andres Freund
Date:
Hi,


The current executor structure limits us in a number of ways that are
becoming problematic.  I'll outline in this email how I think we should
rework it. Including a prototype for the first, most drastic, steps.

The primary problems with the current design that I want to address are:

1) Executor structure makes it hard to do asynchronous execution.  That
   comes into play in several cases. For example we want to be able to
   continue execution if no tuples are ready in one FDW below an append
   node, instead switch to another one. Similarly that could be relevant
   for parallel queries, with multiple gather nodes.

   A bit further out that'd also be very useful for continuing execution
   in a different part of the tree once blocked on IO. That'd e.g. be
   useful to build several hashtables for hashjoins at once.  There's
   some limitations of how we do IO for that atm however, it'd be easier
   to efficiently implement this if we had AIO support.

   With the current, tree-walking based architecture, it's hard to
   implement something like that, because we basically have to re-enter
   into the subtrees where we'd stopped. That'd requires adding new
   state-machine logic in every tree, including additional branches.

   What we basically need here is the ability to stop execution inside a
   specific executor node and have a multiplexer node (which can use
   WaitEventSets or such) decide which part of the execution tree to
   continue executing instead.


2) The tree-walking based approach makes it very hard to JIT compile the
   main portion of query execution without manually re-emitting all of
   executor/, which is obviously entirely unattractive.

   For the expression evaluation JIT the biggest benefit comes from
   removing the indirect branches between the expression steps and then,
   a close second, from having more efficient implementations of
   hot-path expression steps.

   That approach currently isn't feasible for the whole executor. The
   main sources for indirect branches / calls are ExecProcNode() and
   ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make
   those constant we'd basically have to manually emit code for all of
   these (or attempt to rely on the inliner, but that doesn't work
   across deep, potentially self invoking, trees).  The expression code,
   which also used to be tree walking, avoids that now by having the
   indirect function calls nearly exclusively at the top-level.


Secondary issues that I also want to address, or at least get ready to
address, are:


3) Reduce the overhead of executor startup. In a number of workloads
   involving simple queries we currently spend most of the time within
   ExecInitNode().  There's two major problems here: For one a lot of
   work is done at executor initialization that should be part of
   planning (especially creating tupledescs and nodeAgg
   initialization). The other problem is that initialization does a lot
   of tiny allocations.


4) Make the executor faster, leaving JIT aside.

   Moving to a "push based" executor can yield noticable speedups: Right
   now returning a single tuple will often pass through a lot of
   indirect function calls (via ExecProcNode()).  By moving to a push
   based executor (where the first executed piece is reachable without
   any recursion), this can be significantly cheaper.  We also exercise
   the stack *a lot* right now, constantly pushing/poping stuff onto it
   that's not going to be used anytime soon.  Profiling shows that stack
   usage is quite a bottleneck right now.


5) Memory overhead of query execution.

   Right now executing a prepared statment has quite a massive memory
   overhead. We copy around a lot of memory, initialize a lot of nodes
   from scratch, etc. It'd be far better if we could share a lot of
   state between successive executions of the same prepared statement.


6) Plan ahead for vectorized / batched execution.

   Dealing with tuples one-by-one is inefficient from a cache locality,
   code locality and CPU pipelining perspective. Instead fully
   processing a single tuple from a page (evaluating qual, then joining
   it to other things etc), it's a lot more efficient to evaluate the
   qual for all the tuples on a page, and then pass all the matching
   tuples to the next execution step.


After playing with / pondering about various approaches, I think the
right way to approach this is to convert the tree-walk based execution
with a 'program' / bytecode interpreter based approach. Quite similar to
what we've done to expression evaluation in v10 (which then subsequently
was used to implement JIT compilation of expression evaluation in v11).

To address the memory density / startup overhead issue I think we need
to work towards two things: First, most of the memory needed for a query
should come from a single, accurately sized, allocation. Secondly, the
writable memory for query execution should be referenced using memory
offsets, rather than pointers. That way we can clone the writable memory
for query execution.


That's obviously quite a bit of a project.  Or rather projects.


I've started prototyping a solution for this.  The code for the
prototype is at

https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite
(click on the repo to see the git url)
and I plan to continue working on that project in that git branch.


In my first iteration I'd tried to basically address 1-4 in a single
protytype, using a single pass over the plan tree, without using the
current executor initialization. Not impossible, but very very unwieldy.

In the second version of the prototype I'm instead only tackling 1-2,
and reuse the current executor node initialization functions. This
prototype adds support for generating and executing an opcode based
execution plan for a subset of query nodes (seqscan, index [only] scan,
hash / sort based aggregation without grouping sets, limit, inner
nestloop / hashjoin without batches) when the use_linearized_plan is set
and all nodes in the query are supported.

For example, this converts this converts TPCH's Q01:

tpch_1[26142][1]=# SET client_min_messages ='log';
tpch_1[26142][1]=# SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '74 days'
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
LOG:  00000: pp:

into:

0: init_sort
1: seqscan_first
2: seqscan [j empty 5] > s0
3: qual [j fail 2] < scan s0
4: hashagg_tuple [j 2] < s0
5: drain_hashagg [j empty 7] > s1
6: sort_tuple [j 5] < s1
7: sort
8: drain_sort [j empty 10] > s2
9: return < s2 [next 8]
10: done

where s are numbered slots, j are, potentially conditional, jumps. This
works for a number of plans, but there's several where we currently bail
out.

The basic idea is to convert the current, largely unmodified, Plan tree
into a "linear" representation of steps that can be executed by an
interpreter similar to ExecInterpExpr().  Individual parts of a query
tree form "instructions" which together form a small program to execute
the query.  Obviously such a query will contain (conditional) backward
and forward jumps (which differentiates it from the expression case,
which knows only forward jumps).

This means that some of the current executor nodes get split into
multiple parts, because it's not permitted to recurse from within an
executor step.

A tuple can be returned to the user / caller by a special 'return tuple'
step. This just sets the "virtual" instruction pointer to the point
where execution should resume, allowing execution to continue at the
right spot when the next tuple is supposed to be returned (e.g. for
cursors). Instead of, as currently in the prototype, returning control
flow back to ExecutePlan() we could - and quite possibly should -
instead just process the tuple by having a 'send_slot' step or such.
But right now it's a good example how control flow can be interrupted at
any boundary.

For the async FDW (and gather, and IO, ...) cases, you'd have a
'scan_fdw' step that'd normally continue execution, but if there's no
tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet
to watch which of the to-be-waited-for FDWs is ready.


Despite being entirely unoptimized this already yields a measurable
speedup over the current executor for the TPCH queries it supports.


I think the big argument against this kind of project is that it's going
to be quite invasive. Not everything inside the executor will have to be
touched, but it's going to be a significant portion. But I don't think
there's really a way around that for the medium to long term future, and
I don't see how we'd gain anything by waiting further.


My (although I certainly hope/plan to not work on this alone) basic plan
to attack this is:

1) Expand prototype to cover all SELECT queries. That's mostly "just
   work". This'll reuse the current executor initialization with minimal
   modifications.

2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the
   best way to do so is - potentially it's just generating slightly
   different programs, where the steps contain additional code to
   perform the instrumentation.

3) Reimplement EvalPlanQual() infrastructure by generating a separate
   execution plan for EPQ, which replaces the relevant scan nodes with
   an execution step that return tuples from estate->es_epqTuple.  The
   current indirection / branches due to ExecScan() are fairly
   expensive.

4) Remove old executor *Exec() functions, and subsidiary code.

5) Clean up executor nodes, they should now require far less
   information, because a lot of the state will be in executor steps.

6) Introduce a new pass, run before ExecInitNode(), that accounts for
   all the necessary memory for, at least, nodes, slots, expression
   contexts.  Then convert ExecInitNode() to allocate from that.  I'd
   assume that we'd leave expression evaluation out of that, because
   such work would be a largely independent project.

7) Split state in the executor nodes / executor program steps into
   modifiable and non-modifiable parts, and store as much as reasonably
   possible using relative addresses rather than absolute pointers.
   That can then be used to make executor programs reusable and
   reentrant.  That's going to be useful for at least prepared
   statement, plpgsql and better JIT code generation.

8) Write a JIT implementation.


Thoughts?


Regards,

Andres


Re: Redesigning the executor (async, JIT, memory efficiency)

From
Heikki Linnakangas
Date:
Cool stuff!

On 25/05/18 06:35, Andres Freund wrote:
> For example, this converts this converts TPCH's Q01:
> 
> tpch_1[26142][1]=# SET client_min_messages ='log';
> tpch_1[26142][1]=# SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) AS sum_qty,
>     sum(l_extendedprice) AS sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
>     avg(l_quantity) AS avg_qty,
>     avg(l_extendedprice) AS avg_price,
>     avg(l_discount) AS avg_disc,
>     count(*) AS count_order
> FROM
>     lineitem
> WHERE
>     l_shipdate <= date '1998-12-01' - interval '74 days'
> GROUP BY
>     l_returnflag,
>     l_linestatus
> ORDER BY
>     l_returnflag,
>     l_linestatus;
> LOG:  00000: pp:
> 
> into:
> 
> 0: init_sort
> 1: seqscan_first
> 2: seqscan [j empty 5] > s0
> 3: qual [j fail 2] < scan s0
> 4: hashagg_tuple [j 2] < s0
> 5: drain_hashagg [j empty 7] > s1
> 6: sort_tuple [j 5] < s1
> 7: sort
> 8: drain_sort [j empty 10] > s2
> 9: return < s2 [next 8]
> 10: done
> 
> where s are numbered slots, j are, potentially conditional, jumps. This
> works for a number of plans, but there's several where we currently bail
> out.

How is this going to look like in EXPLAIN, even without ANALYZE? Would 
it still show a tree plan, or this linear program, or both?

> Despite being entirely unoptimized this already yields a measurable
> speedup over the current executor for the TPCH queries it supports.

Oh, that's impressive!

- Heikki


Re: Redesigning the executor (async, JIT, memory efficiency)

From
Andres Freund
Date:
Hi,

On 2018-05-25 09:26:47 +0300, Heikki Linnakangas wrote:
> Cool stuff!

Thanks!

> On 25/05/18 06:35, Andres Freund wrote:
> > For example, this converts this converts TPCH's Q01:
> > 
> > tpch_1[26142][1]=# SET client_min_messages ='log';
> > tpch_1[26142][1]=# SELECT
> >     l_returnflag,
> >     l_linestatus,
> >     sum(l_quantity) AS sum_qty,
> >     sum(l_extendedprice) AS sum_base_price,
> >     sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
> >     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
> >     avg(l_quantity) AS avg_qty,
> >     avg(l_extendedprice) AS avg_price,
> >     avg(l_discount) AS avg_disc,
> >     count(*) AS count_order
> > FROM
> >     lineitem
> > WHERE
> >     l_shipdate <= date '1998-12-01' - interval '74 days'
> > GROUP BY
> >     l_returnflag,
> >     l_linestatus
> > ORDER BY
> >     l_returnflag,
> >     l_linestatus;
> > LOG:  00000: pp:
> > 
> > into:
> > 
> > 0: init_sort
> > 1: seqscan_first
> > 2: seqscan [j empty 5] > s0
> > 3: qual [j fail 2] < scan s0
> > 4: hashagg_tuple [j 2] < s0
> > 5: drain_hashagg [j empty 7] > s1
> > 6: sort_tuple [j 5] < s1
> > 7: sort
> > 8: drain_sort [j empty 10] > s2
> > 9: return < s2 [next 8]
> > 10: done
> > 
> > where s are numbered slots, j are, potentially conditional, jumps. This
> > works for a number of plans, but there's several where we currently bail
> > out.
> 
> How is this going to look like in EXPLAIN, even without ANALYZE? Would it
> still show a tree plan, or this linear program, or both?

I think we're going to have to continue showing the tree plan. I think
the linear version is far too hard to understand for anything
nontrivial.  I'd definitely continue to expect the Plan tree to be tree
shaped, and I'm currently not forseeing to entirely get rid of PlanState
trees (albeit with smaller nodes, and potentially with relative rather
than absolute pointers to children).  So I don't really forsee a huge
problem continuing to display a tree shaped plan.

I'd expect there'd be an EXPLAIN option to show the linear plan, but
that that'd not be used frequently outside of development work.

Greetings,

Andres Freund


Re: Redesigning the executor (async, JIT, memory efficiency)

From
Joe Conway
Date:
On 05/24/2018 11:26 PM, Heikki Linnakangas wrote:
> Cool stuff!
>
> On 25/05/18 06:35, Andres Freund wrote:
>> For example, this converts this converts TPCH's Q01:

+1
Wicked cool!


--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Attachment

Re: Redesigning the executor (async, JIT, memory efficiency)

From
Robert Haas
Date:
On Fri, May 25, 2018 at 2:40 AM, Andres Freund <andres@anarazel.de> wrote:
> I think we're going to have to continue showing the tree plan. I think
> the linear version is far too hard to understand for anything
> nontrivial.

Some of that is because this format is intrinsically hard to read, but
it's also partly because of unfamiliarity and partly because you
haven't put much work into it.  You could certainly make it easier to
read just by expending a little more effort on it:

0: Initialize Sort (for step 7)
1: Initialize Seq Scan on lineitem (for step 2)
2: Iterate Seq Scan on linitem into slot 0
    If no more tuples, go to step 5.
3: Test slot 0 using <deparsed qual>
    If failed, go to step 2.
4: Load Hash Aggregate from slot 0.
    Go to step 2.
5: Retrieve from Hash Aggregate into slot 1.
    If no more tuples, go to step 7.
etc.

Even with this sort of TLC, I agree that reading this is no picnic,
but I think continuing to show the tree structure when the executing
plan has been linearized seems to have its own set of difficulties.
The way that time is reported today in EXPLAIN ANALYZE presumes that
execute recurses down the tree; that's how times for higher nodes end
up including times for subordinate nodes.  If we time each step
executed using this method, that will no longer be true.  We could
make it true.  For example we could make the Seq Scan time include
steps 1, 2, and 3, the Hash Aggregate on top of it include that time
plus steps 4 and 5, the Sort on top of that also include steps 0, 6,
7, and 8, etc.  But that sucks.  If we've got timing information for
each step individually, that's awesome, and power users are going to
want to see it.  One of the most common questions one has when looking
at today's EXPLAIN ANALYZE output is "where did the time go?" and we
want to provide the most accurate possible answer to that question.

Also, consider a plan that today looks like this:

Append
-> Seq Scan on p1
-> Seq Scan on p2
-> Seq Scan on p3

The linearized plan could omit the Append node altogether.  We just
went and made Append nodes have a cost because they do, but in cases
where the Append is neither parallel-aware nor the new async-capable
Append type we're presumably going to add nor doing partition pruning,
we can and probably should optimize it away completely.  That will
change which plan is fastest in some cases, I'm fairly sure.  It
doesn't keep you from reporting on the Append node as if it still
existed, but if you do, you're sorta lying.  Another case that I think
is kind of interesting is:

Nested Loop Semi Join
-> Seq Scan
-> Index Scan
    Index Cond: ...

If the index cond passes, we can emit the tuple returned by the Seq
Scan and jump back to fetch the next tuple from the Seq Scan.  Perhaps
the Nested Loop doesn't really exist at all.  If the join is something
other than a semi/anti join then you probably need the join to
project, unless all of the columns needed are from one side, but
that's all, unless the join itself has a Join Filter.  I think users
are going to want to see EXPLAIN ANALYZE output that makes it really
clear how the query has been optimized.  Sure, the existing format may
be easier for users in the short term, and certainly can be an option,
but seeing something that conceals rather than reveals how the
execution is really progressing may not be all that great in the end.

Maybe there are other ways of displaying the linearized plan that
would increase comprehensibility further.  For example, in your
version of the plan for TPC-H Q01, steps 0-1 are initialization, 2-4
are a loop, 5-6 are a loop, 7 is an independent step, 8-9 are another
loop, and 10 represents completion.  Maybe that can be used somehow to
make something easier to understand.  For example:

Initialize
-> Initialize Sort 1
-> Initiialize Seq Scan on lineitem
Loop
-> Iterate Seq Scan on lineitem
    Break if no more tuples
-> Continue if ...
-> Load Into Hash Aggregate 1
Loop
-> Fetch From Hash Aggregate 1
-> Load Into Sort 1
Perform Sort 1
Loop
-> Fetch from Sort 1
-> Return Tuple

Maybe that doesn't seem that much easier, but at least it doesn't
involve needing to understand step numbers.  It strikes me that this
kind of organization could potentially also be useful for a query
progress indicator.  If the whole plan is a series of loops and other
steps, you can say things like -- well, the steps we've already
completed represent 30% of the estimated cost.  The steps we haven't
started yet represent another 30% of the estimated cost.  So we're
somewhere between 30% done and 70% done.  And then we can guess what
fraction of the iterations we've completed for the loop we're
currently performing.  Obviously this is all pie-in-the-sky, but it
certainly seems like this kind of thing would enable estimation that
we can't easily perform today.

How will a linearized plan handle handle backward fetches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Redesigning the executor (async, JIT, memory efficiency)

From
Andres Freund
Date:
On 2018-05-26 17:08:51 -0400, Robert Haas wrote:
> On Fri, May 25, 2018 at 2:40 AM, Andres Freund <andres@anarazel.de> wrote:
> > I think we're going to have to continue showing the tree plan. I think
> > the linear version is far too hard to understand for anything
> > nontrivial.
>
> Some of that is because this format is intrinsically hard to read, but
> it's also partly because of unfamiliarity and partly because you
> haven't put much work into it.  You could certainly make it easier to
> read just by expending a little more effort on it:

Right - didn't seem like the highest priority thing to tackle. But
without displaying hierarchy even a lot of TLC won't make it easy to
understand.  Being able to quickly ignore, or even hide if you use
something like explain.depesz.com, entire subtrees is quite valuable.
We don't compose programs in one function + a lot of goto statements for
a reason...


> Even with this sort of TLC, I agree that reading this is no picnic,
> but I think continuing to show the tree structure when the executing
> plan has been linearized seems to have its own set of difficulties.

I'm not so sure. The plan would still be in a tree form, and I do think
realistically people do and should care more about the planner choices
rather than minute details of how things are actually executed.
Especially if we're going to whack that around on a granular level
repeatedly.


> Also, consider a plan that today looks like this:
>
> Append
> -> Seq Scan on p1
> -> Seq Scan on p2
> -> Seq Scan on p3
>
> The linearized plan could omit the Append node altogether.

Sure, but instrumentation nodes could continue to have a reference to
the Plan/Append node as their parent.


> How will a linearized plan handle handle backward fetches?

I don't quite know yet, TBH.  I'm playing with a few alternatives,
including:

a) Don't support it in general, require materialization at the top
   level. Has the advantage of minimal complexity that's only incurred
   when actually needed. And I think scrolling backwards is extremely
   rare in practice.
b) Emit a separate execution program the first time a different
   direction is required. But share enough state that the program can
   just be switched over.  That'd allow to remove all the necessary
   branches to support backward scan from the hot paths.
c) Continue as currently done, essentially.

Greetings,

Andres Freund


Re: Redesigning the executor (async, JIT, memory efficiency)

From
Douglas Doole
Date:
I think we're going to have to continue showing the tree plan. I think
the linear version is far too hard to understand for anything
nontrivial. 

Hey Andres, what you're pitching here is very similar to the way DB2 works. It actually has two different explain tools that dumps the two different formats. The tree dump is the preferred format, but there are times where the human-readable dump of the executor steps is very useful. Particularly, as mentioned elsewhere in this thread, when the executor steps diverge from the plan.

One thing that helps correlate the two formats is that each node in the tree dump is numbered and each step in the executor is annotated with the corresponding tree node number.

Another tool that was invaluable is a readable dump (effectively a disassembly) of the byte code. It was only useful to developers, but when something goes wrong in the executor it was incredibly useful to see exactly what the byte code was specifying (as opposed the human readable format of explain, which could hide subtle details.)

- Doug
Salesforce

Re: Redesigning the executor (async, JIT, memory efficiency)

From
Kyotaro HORIGUCHI
Date:
Thanks! I've been waiting for this.

At Thu, 24 May 2018 20:35:39 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180525033538.6ypfwcqcxce6zkjj@alap3.anarazel.de>
> The current executor structure limits us in a number of ways that are
> becoming problematic.  I'll outline in this email how I think we should
> rework it. Including a prototype for the first, most drastic, steps.
> 
> The primary problems with the current design that I want to address are:
> 
> 1) Executor structure makes it hard to do asynchronous execution.  That
>    comes into play in several cases. For example we want to be able to
>    continue execution if no tuples are ready in one FDW below an append
>    node, instead switch to another one. Similarly that could be relevant
>    for parallel queries, with multiple gather nodes.
> 
>    A bit further out that'd also be very useful for continuing execution
>    in a different part of the tree once blocked on IO. That'd e.g. be
>    useful to build several hashtables for hashjoins at once.  There's
>    some limitations of how we do IO for that atm however, it'd be easier
>    to efficiently implement this if we had AIO support.
> 
>    With the current, tree-walking based architecture, it's hard to
>    implement something like that, because we basically have to re-enter
>    into the subtrees where we'd stopped. That'd requires adding new
>    state-machine logic in every tree, including additional branches.
> 
>    What we basically need here is the ability to stop execution inside a
>    specific executor node and have a multiplexer node (which can use
>    WaitEventSets or such) decide which part of the execution tree to
>    continue executing instead.

Yeah. Finally async execution requires that capability.

> 2) The tree-walking based approach makes it very hard to JIT compile the
>    main portion of query execution without manually re-emitting all of
>    executor/, which is obviously entirely unattractive.
> 
>    For the expression evaluation JIT the biggest benefit comes from
>    removing the indirect branches between the expression steps and then,
>    a close second, from having more efficient implementations of
>    hot-path expression steps.
> 
>    That approach currently isn't feasible for the whole executor. The
>    main sources for indirect branches / calls are ExecProcNode() and
>    ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make
>    those constant we'd basically have to manually emit code for all of
>    these (or attempt to rely on the inliner, but that doesn't work
>    across deep, potentially self invoking, trees).  The expression code,
>    which also used to be tree walking, avoids that now by having the
>    indirect function calls nearly exclusively at the top-level.

I looked into this a bit. One of most hard point in flattening
executor (in my previous attempt) *was* how to inside-out'ing the
Exec* functions, especially Agg/WindowAgg. The highlight of the
hardness was it could be utterly unreadable if *I* reconstruct,
say WindowAgg, or non-hash-Agg into functions that is called with
tuples from underlying nodes. In the repo, nodeAgg is flattened
only for the hash-agg case so it seems rather straight-forward
but I'm afraid that such complex nodes becomes collections of
mystic fractions that should be work when called in the sequence
written as a magic spell casted by the compiler part.. (Sorry in
advance for the maybe hard-to-read blob.)

One related concern is finally the "mystic fractions" are no
longer node-private but global ones called from
ExecExecProgram. Essentially JIT'ing is a kind of such and in the
current ExecExpr case, it seems to me tolerablly small. However,
I'm not sure it is maintainable (for moderate developers) if all
such fractions are connected via ExecExecProgram, following a
magic spell.

> Secondary issues that I also want to address, or at least get ready to
> address, are:
> 
> 
> 3) Reduce the overhead of executor startup. In a number of workloads
>    involving simple queries we currently spend most of the time within
>    ExecInitNode().  There's two major problems here: For one a lot of
>    work is done at executor initialization that should be part of
>    planning (especially creating tupledescs and nodeAgg
>    initialization). The other problem is that initialization does a lot
>    of tiny allocations.
> 
> 
> 4) Make the executor faster, leaving JIT aside.
> 
>    Moving to a "push based" executor can yield noticable speedups: Right
>    now returning a single tuple will often pass through a lot of
>    indirect function calls (via ExecProcNode()).  By moving to a push
>    based executor (where the first executed piece is reachable without
>    any recursion), this can be significantly cheaper.  We also exercise
>    the stack *a lot* right now, constantly pushing/poping stuff onto it
>    that's not going to be used anytime soon.  Profiling shows that stack
>    usage is quite a bottleneck right now.
> 
> 
> 5) Memory overhead of query execution.
> 
>    Right now executing a prepared statment has quite a massive memory
>    overhead. We copy around a lot of memory, initialize a lot of nodes
>    from scratch, etc. It'd be far better if we could share a lot of
>    state between successive executions of the same prepared statement.
> 
> 
> 6) Plan ahead for vectorized / batched execution.
> 
>    Dealing with tuples one-by-one is inefficient from a cache locality,
>    code locality and CPU pipelining perspective. Instead fully
>    processing a single tuple from a page (evaluating qual, then joining
>    it to other things etc), it's a lot more efficient to evaluate the
>    qual for all the tuples on a page, and then pass all the matching
>    tuples to the next execution step.
> 
> 
> After playing with / pondering about various approaches, I think the
> right way to approach this is to convert the tree-walk based execution
> with a 'program' / bytecode interpreter based approach. Quite similar to
> what we've done to expression evaluation in v10 (which then subsequently
> was used to implement JIT compilation of expression evaluation in v11).

I agree that the direction overall, but I have a concern about
maintainability as above.

> To address the memory density / startup overhead issue I think we need
> to work towards two things: First, most of the memory needed for a query
> should come from a single, accurately sized, allocation. Secondly, the
> writable memory for query execution should be referenced using memory
> offsets, rather than pointers. That way we can clone the writable memory
> for query execution.
> 
> 
> That's obviously quite a bit of a project.  Or rather projects.
> 
> 
> I've started prototyping a solution for this.  The code for the
> prototype is at
>
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite
> (click on the repo to see the git url)
> and I plan to continue working on that project in that git branch.
> 
> 
> In my first iteration I'd tried to basically address 1-4 in a single
> protytype, using a single pass over the plan tree, without using the
> current executor initialization. Not impossible, but very very unwieldy.
> 
> In the second version of the prototype I'm instead only tackling 1-2,
> and reuse the current executor node initialization functions. This
> prototype adds support for generating and executing an opcode based
> execution plan for a subset of query nodes (seqscan, index [only] scan,
> hash / sort based aggregation without grouping sets, limit, inner
> nestloop / hashjoin without batches) when the use_linearized_plan is set
> and all nodes in the query are supported.
> 
> For example, this converts this converts TPCH's Q01:
> 
> tpch_1[26142][1]=# SET client_min_messages ='log';
> tpch_1[26142][1]=# SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) AS sum_qty,
>     sum(l_extendedprice) AS sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
>     avg(l_quantity) AS avg_qty,
>     avg(l_extendedprice) AS avg_price,
>     avg(l_discount) AS avg_disc,
>     count(*) AS count_order
> FROM
>     lineitem
> WHERE
>     l_shipdate <= date '1998-12-01' - interval '74 days'
> GROUP BY
>     l_returnflag,
>     l_linestatus
> ORDER BY
>     l_returnflag,
>     l_linestatus;
> LOG:  00000: pp:
> 
> into:
> 
> 0: init_sort
> 1: seqscan_first
> 2: seqscan [j empty 5] > s0
> 3: qual [j fail 2] < scan s0
> 4: hashagg_tuple [j 2] < s0
> 5: drain_hashagg [j empty 7] > s1
> 6: sort_tuple [j 5] < s1
> 7: sort
> 8: drain_sort [j empty 10] > s2
> 9: return < s2 [next 8]
> 10: done
> 
> where s are numbered slots, j are, potentially conditional, jumps. This
> works for a number of plans, but there's several where we currently bail
> out.
> 
> The basic idea is to convert the current, largely unmodified, Plan tree
> into a "linear" representation of steps that can be executed by an
> interpreter similar to ExecInterpExpr().  Individual parts of a query
> tree form "instructions" which together form a small program to execute
> the query.  Obviously such a query will contain (conditional) backward
> and forward jumps (which differentiates it from the expression case,
> which knows only forward jumps).
> 
> This means that some of the current executor nodes get split into
> multiple parts, because it's not permitted to recurse from within an
> executor step.
> 
> A tuple can be returned to the user / caller by a special 'return tuple'
> step. This just sets the "virtual" instruction pointer to the point
> where execution should resume, allowing execution to continue at the
> right spot when the next tuple is supposed to be returned (e.g. for
> cursors). Instead of, as currently in the prototype, returning control
> flow back to ExecutePlan() we could - and quite possibly should -
> instead just process the tuple by having a 'send_slot' step or such.
> But right now it's a good example how control flow can be interrupted at
> any boundary.
> 
> For the async FDW (and gather, and IO, ...) cases, you'd have a
> 'scan_fdw' step that'd normally continue execution, but if there's no
> tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet
> to watch which of the to-be-waited-for FDWs is ready.

Finally, the executor will become push-based where all "source"
nodes conducts execution, not only by FDW nodes but all source
nodes including local scans, I suppose.

> Despite being entirely unoptimized this already yields a measurable
> speedup over the current executor for the TPCH queries it supports.
> 
> 
> I think the big argument against this kind of project is that it's going
> to be quite invasive. Not everything inside the executor will have to be
> touched, but it's going to be a significant portion. But I don't think
> there's really a way around that for the medium to long term future, and
> I don't see how we'd gain anything by waiting further.
> 
> 
> My (although I certainly hope/plan to not work on this alone) basic plan
> to attack this is:
> 
> 1) Expand prototype to cover all SELECT queries. That's mostly "just
>    work". This'll reuse the current executor initialization with minimal
>    modifications.
> 
> 2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the
>    best way to do so is - potentially it's just generating slightly
>    different programs, where the steps contain additional code to
>    perform the instrumentation.
> 
> 3) Reimplement EvalPlanQual() infrastructure by generating a separate
>    execution plan for EPQ, which replaces the relevant scan nodes with
>    an execution step that return tuples from estate->es_epqTuple.  The
>    current indirection / branches due to ExecScan() are fairly
>    expensive.
> 
> 4) Remove old executor *Exec() functions, and subsidiary code.
> 
> 5) Clean up executor nodes, they should now require far less
>    information, because a lot of the state will be in executor steps.
> 
> 6) Introduce a new pass, run before ExecInitNode(), that accounts for
>    all the necessary memory for, at least, nodes, slots, expression
>    contexts.  Then convert ExecInitNode() to allocate from that.  I'd
>    assume that we'd leave expression evaluation out of that, because
>    such work would be a largely independent project.
> 
> 7) Split state in the executor nodes / executor program steps into
>    modifiable and non-modifiable parts, and store as much as reasonably
>    possible using relative addresses rather than absolute pointers.
>    That can then be used to make executor programs reusable and
>    reentrant.  That's going to be useful for at least prepared
>    statement, plpgsql and better JIT code generation.
> 
> 8) Write a JIT implementation.
> 
> 
> Thoughts?
> 
> 
> Regards,
> 
> Andres

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center