Redesigning the executor (async, JIT, memory efficiency) - Mailing list pgsql-hackers

From Andres Freund
Subject Redesigning the executor (async, JIT, memory efficiency)
Date
Msg-id 20180525033538.6ypfwcqcxce6zkjj@alap3.anarazel.de
Whole thread Raw
Responses Re: Redesigning the executor (async, JIT, memory efficiency)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Re: Redesigning the executor (async, JIT, memory efficiency)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
Hi,


The current executor structure limits us in a number of ways that are
becoming problematic.  I'll outline in this email how I think we should
rework it. Including a prototype for the first, most drastic, steps.

The primary problems with the current design that I want to address are:

1) Executor structure makes it hard to do asynchronous execution.  That
   comes into play in several cases. For example we want to be able to
   continue execution if no tuples are ready in one FDW below an append
   node, instead switch to another one. Similarly that could be relevant
   for parallel queries, with multiple gather nodes.

   A bit further out that'd also be very useful for continuing execution
   in a different part of the tree once blocked on IO. That'd e.g. be
   useful to build several hashtables for hashjoins at once.  There's
   some limitations of how we do IO for that atm however, it'd be easier
   to efficiently implement this if we had AIO support.

   With the current, tree-walking based architecture, it's hard to
   implement something like that, because we basically have to re-enter
   into the subtrees where we'd stopped. That'd requires adding new
   state-machine logic in every tree, including additional branches.

   What we basically need here is the ability to stop execution inside a
   specific executor node and have a multiplexer node (which can use
   WaitEventSets or such) decide which part of the execution tree to
   continue executing instead.


2) The tree-walking based approach makes it very hard to JIT compile the
   main portion of query execution without manually re-emitting all of
   executor/, which is obviously entirely unattractive.

   For the expression evaluation JIT the biggest benefit comes from
   removing the indirect branches between the expression steps and then,
   a close second, from having more efficient implementations of
   hot-path expression steps.

   That approach currently isn't feasible for the whole executor. The
   main sources for indirect branches / calls are ExecProcNode() and
   ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make
   those constant we'd basically have to manually emit code for all of
   these (or attempt to rely on the inliner, but that doesn't work
   across deep, potentially self invoking, trees).  The expression code,
   which also used to be tree walking, avoids that now by having the
   indirect function calls nearly exclusively at the top-level.


Secondary issues that I also want to address, or at least get ready to
address, are:


3) Reduce the overhead of executor startup. In a number of workloads
   involving simple queries we currently spend most of the time within
   ExecInitNode().  There's two major problems here: For one a lot of
   work is done at executor initialization that should be part of
   planning (especially creating tupledescs and nodeAgg
   initialization). The other problem is that initialization does a lot
   of tiny allocations.


4) Make the executor faster, leaving JIT aside.

   Moving to a "push based" executor can yield noticable speedups: Right
   now returning a single tuple will often pass through a lot of
   indirect function calls (via ExecProcNode()).  By moving to a push
   based executor (where the first executed piece is reachable without
   any recursion), this can be significantly cheaper.  We also exercise
   the stack *a lot* right now, constantly pushing/poping stuff onto it
   that's not going to be used anytime soon.  Profiling shows that stack
   usage is quite a bottleneck right now.


5) Memory overhead of query execution.

   Right now executing a prepared statment has quite a massive memory
   overhead. We copy around a lot of memory, initialize a lot of nodes
   from scratch, etc. It'd be far better if we could share a lot of
   state between successive executions of the same prepared statement.


6) Plan ahead for vectorized / batched execution.

   Dealing with tuples one-by-one is inefficient from a cache locality,
   code locality and CPU pipelining perspective. Instead fully
   processing a single tuple from a page (evaluating qual, then joining
   it to other things etc), it's a lot more efficient to evaluate the
   qual for all the tuples on a page, and then pass all the matching
   tuples to the next execution step.


After playing with / pondering about various approaches, I think the
right way to approach this is to convert the tree-walk based execution
with a 'program' / bytecode interpreter based approach. Quite similar to
what we've done to expression evaluation in v10 (which then subsequently
was used to implement JIT compilation of expression evaluation in v11).

To address the memory density / startup overhead issue I think we need
to work towards two things: First, most of the memory needed for a query
should come from a single, accurately sized, allocation. Secondly, the
writable memory for query execution should be referenced using memory
offsets, rather than pointers. That way we can clone the writable memory
for query execution.


That's obviously quite a bit of a project.  Or rather projects.


I've started prototyping a solution for this.  The code for the
prototype is at

https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite
(click on the repo to see the git url)
and I plan to continue working on that project in that git branch.


In my first iteration I'd tried to basically address 1-4 in a single
protytype, using a single pass over the plan tree, without using the
current executor initialization. Not impossible, but very very unwieldy.

In the second version of the prototype I'm instead only tackling 1-2,
and reuse the current executor node initialization functions. This
prototype adds support for generating and executing an opcode based
execution plan for a subset of query nodes (seqscan, index [only] scan,
hash / sort based aggregation without grouping sets, limit, inner
nestloop / hashjoin without batches) when the use_linearized_plan is set
and all nodes in the query are supported.

For example, this converts this converts TPCH's Q01:

tpch_1[26142][1]=# SET client_min_messages ='log';
tpch_1[26142][1]=# SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '74 days'
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
LOG:  00000: pp:

into:

0: init_sort
1: seqscan_first
2: seqscan [j empty 5] > s0
3: qual [j fail 2] < scan s0
4: hashagg_tuple [j 2] < s0
5: drain_hashagg [j empty 7] > s1
6: sort_tuple [j 5] < s1
7: sort
8: drain_sort [j empty 10] > s2
9: return < s2 [next 8]
10: done

where s are numbered slots, j are, potentially conditional, jumps. This
works for a number of plans, but there's several where we currently bail
out.

The basic idea is to convert the current, largely unmodified, Plan tree
into a "linear" representation of steps that can be executed by an
interpreter similar to ExecInterpExpr().  Individual parts of a query
tree form "instructions" which together form a small program to execute
the query.  Obviously such a query will contain (conditional) backward
and forward jumps (which differentiates it from the expression case,
which knows only forward jumps).

This means that some of the current executor nodes get split into
multiple parts, because it's not permitted to recurse from within an
executor step.

A tuple can be returned to the user / caller by a special 'return tuple'
step. This just sets the "virtual" instruction pointer to the point
where execution should resume, allowing execution to continue at the
right spot when the next tuple is supposed to be returned (e.g. for
cursors). Instead of, as currently in the prototype, returning control
flow back to ExecutePlan() we could - and quite possibly should -
instead just process the tuple by having a 'send_slot' step or such.
But right now it's a good example how control flow can be interrupted at
any boundary.

For the async FDW (and gather, and IO, ...) cases, you'd have a
'scan_fdw' step that'd normally continue execution, but if there's no
tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet
to watch which of the to-be-waited-for FDWs is ready.


Despite being entirely unoptimized this already yields a measurable
speedup over the current executor for the TPCH queries it supports.


I think the big argument against this kind of project is that it's going
to be quite invasive. Not everything inside the executor will have to be
touched, but it's going to be a significant portion. But I don't think
there's really a way around that for the medium to long term future, and
I don't see how we'd gain anything by waiting further.


My (although I certainly hope/plan to not work on this alone) basic plan
to attack this is:

1) Expand prototype to cover all SELECT queries. That's mostly "just
   work". This'll reuse the current executor initialization with minimal
   modifications.

2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the
   best way to do so is - potentially it's just generating slightly
   different programs, where the steps contain additional code to
   perform the instrumentation.

3) Reimplement EvalPlanQual() infrastructure by generating a separate
   execution plan for EPQ, which replaces the relevant scan nodes with
   an execution step that return tuples from estate->es_epqTuple.  The
   current indirection / branches due to ExecScan() are fairly
   expensive.

4) Remove old executor *Exec() functions, and subsidiary code.

5) Clean up executor nodes, they should now require far less
   information, because a lot of the state will be in executor steps.

6) Introduce a new pass, run before ExecInitNode(), that accounts for
   all the necessary memory for, at least, nodes, slots, expression
   contexts.  Then convert ExecInitNode() to allocate from that.  I'd
   assume that we'd leave expression evaluation out of that, because
   such work would be a largely independent project.

7) Split state in the executor nodes / executor program steps into
   modifiable and non-modifiable parts, and store as much as reasonably
   possible using relative addresses rather than absolute pointers.
   That can then be used to make executor programs reusable and
   reentrant.  That's going to be useful for at least prepared
   statement, plpgsql and better JIT code generation.

8) Write a JIT implementation.


Thoughts?


Regards,

Andres


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Keeping temporary tables in shared buffers
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Fix some error handling for read() and errno