Re: asynchronous and vectorized execution - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: asynchronous and vectorized execution
Date
Msg-id CAJ3gD9cN6dj8+jYES0M7MZ1pZdj4x6L5NVgw1aUwoShpJ1WhTw@mail.gmail.com
Whole thread Raw
In response to asynchronous and vectorized execution  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: asynchronous and vectorized execution
List pgsql-hackers
We may also want to consider handling abstract events such as "tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in case of an Append node containing foreign scans, its parent node may not want to process the Append node result until Append is ready with at least 1000 rows. In that case, Append node needs to raise an event "n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event will wake up the foreign scan, but it may not eventually cause its parent Append node to in turn wake up it's parent. 

Other benefit (which I am not sure how significant it is) is this part of the event : "at-plan-node-X". For e.g., for an Append node having 10 foreign scans, when a foreign scan wakes up and becomes ready with tuple(s), it's parent node (i.e. Append) will be executed. But it would speed up things if it knows which of it's foreign scan nodes are ready. From Robert's prototype, I can see that it can find that out by checking the result_ready field of each foreign scan plan state. But if it knows from the event that the node-X is the one who is ready, it can directly take tuples from there. Another thing is, we may want to give the Append node a chance to know all those nodes that are ready, instead of just one node.

How we can do this event abstraction is the other question. We can have one latch for each of the event, and each node would raise its own event by setting the corresponding latch. But I am not sure about latches within a single process as against one process waking up another process. Or else, some internal event structures needs to be present (in estate ?), which then ExecProcNode would use when it does the event looping to wake up (i.e. execute) required nodes.

Also, the WaitEvent.user_data field can have some more info besides the plan state. It can have its parent PlanState stored, so that we don't have to have parent field in plan state. It also can have some more data such as "n-tuples-available".


On 9 May 2016 at 23:03, Robert Haas <robertmhaas@gmail.com> wrote:
Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there.  I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who).  To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor.  It's too
slow and it doesn't do everything we want it to do.  The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits.  This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS.  Whether or not this will be efficient is a
research question, but it can be done.  However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer.  In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree.  For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one.  Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later.  I think that's
probably right.   For example, consider a 5-table join where all of
the joins are implemented as hash tables.  If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on.  What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat.  If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse.   But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join.  The inner loop fetches the next tuple from the
probe table and does a hash lookup.  Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple.  If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more.  Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here.  We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code.  It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other.  Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once.  A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper.  For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple.  Similarly, for vectorized execution,
a node might return a bunch of tuples all at once.  ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted.  In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today.  Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities.  They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it.  If not, ExecProcNode will paper over the problem.  This
seems to me to be a good way to keep the code simple.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like.  This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking.  Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them.  There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior.  This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run.  If that node consequently becomes ready, then its
parent gets a chance to run.  Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.
With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case.  What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node.  I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
        /*
         * Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
         * (and this is very bad) - so, here we do not check are keys ok or not.
         */
        return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster.  Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated.  But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Thoughts, ideas, suggestions, etc. very welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-hackers by date:

Previous
From: Shawn
Date:
Subject: Re: An unkillable connection caused replication delay on my replica
Next
From: Tom Lane
Date:
Subject: Re: Strange behavior of some volatile function like random(), nextval()