Re: [PoC] Asynchronous execution again (which is not parallel) - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [PoC] Asynchronous execution again (which is not parallel)
Date
Msg-id 20160121.182649.90876965.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [PoC] Asynchronous execution again (which is not parallel)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [PoC] Asynchronous execution again (which is not parallel)  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Re: [PoC] Asynchronous execution again (which is not parallel)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello,

I put some consideration and trial on callbacks as a means to
async(early)-execution.

> > Suppose we equip each EState with the ability to fire "callbacks".
> > Callbacks have the signature:
> > 
> > typedef bool (*ExecCallback)(PlanState *planstate, TupleTableSlot
> > *slot, void *context);
> > 
> > Executor nodes can register immediate callbacks to be run at the
> > earliest possible opportunity using a function like
> > ExecRegisterCallback(estate, callback, planstate, slot, context).
> > They can registered deferred callbacks that will be called when a file
> > descriptor becomes ready for I/O, or when the process latch is set,
> > using a call like ExecRegisterFileCallback(estate, fd, event,
> > callback, planstate, slot, context) or
> > ExecRegisterLatchCallback(estate, callback, planstate, slot, context).

I considered on this. The immediate callbacks seems fine but
using latch or fds to signal tuple availability doesn't seem to
fit callbacks stored in estate. They are deferrable until
parent's tuple request and such kind of events can be handled at
the time as ExecGather does now. However some kind of
synchronize/waiting mechanism like latch or select() is needed
anyway.

> > To execute callbacks, an executor node can call ExecFireCallbacks(),
> > which will fire immediate callbacks in order of registration, and wait
> > for the file descriptors for which callbacks have been registered and
> > for the process latch when no immediate callbacks remain but there are
> > still deferred callbacks.  It will return when (1) there are no
> > remaining immediate or deferred callbacks or (2) one of the callbacks
> > returns "true".
> 
> Excellent! I unconsciously excluded the case of callbacks because
> I supposed (without certain ground) all executor nodes can have a
> chance to win from this. Such callback is a good choice to do
> what Start*Node did in the lastest patch.

The previous code added a large amount of garbage, which was the
mechanism of async-execution including additional code for
ExecStartNode phase in the same manner to ExecProcNode and
ExecEndNode. Most of the additional code is totally useless for
most of the types of node.

Callback is usable for not-so-common invoked-for-a-event-at-once
operations such like error-handling. For this case, the
operations can be asynch-execution of a node and the event can be
just before ExecProcNode on the topmost node. The first patch
attached allows async-capable nodes to register callbacks on Init
phase and executes them just before Exec phase on the topmost
node. It grately reduces the additional code as the result. My
first impression from the word "callbacks" is this.

The following operation yields LOG messages from dummy callback
with this patch.

CREATE TABLE t1 (a int, b int);
INSERT INTO t1 (SELECT a, 1 FROM generate_series(0, 99) a);
CREATE TABLE t2 (a int, b int);
INSERT INTO t2 (SELECT a, 2 FROM generate_series(0, 99) a);
CREATE TABLE t3 (a int, b int);
INSERT INTO t3 (SELECT a, 3 FROM generate_series(0, 99) a);
SELECT * FROM t1 UNION ALL SELECT * FROM t2 UNION ALL SELECT * FROM t3;
===
LOG:  dummy_async_cb is called for 0x2783a98
LOG:  dummy_async_cb is called for 0x2784248
LOG:  dummy_async_cb is called for 0x2784ad0

What my previous patch could do is doable by this first patch
with far less diffs.

If this design is not bad, I'll do postgres_fdw part.


Next is discussion about async tuple fetching.

> > Then, suppose we add a function bool ExecStartAsync(PlanState *target,
> > ExecCallback callback, PlanState *cb_planstate, void *cb_context).
> > For non-async-aware plan nodes, this just returns false.  async-aware
> > plan nodes should initiate some work, register some callbacks, and
> > return.  The callback that get registered should arrange in turn to
> > register the callback passed as an argument when a tuple becomes
> > available, passing the planstate and context provided by
> > ExecStartAsync's caller, plus the TupleTableSlot containing the tuple.
> 
> Although I don't imagine clearly about the case of
> async-aware-nodes under non-aware-nodes, it seems to have a high
> affinity with (true) parallel execution framework.

The ExecStartAsync is similar to ExecStartNode of my old
patch. One of the most annoying things of that is that it needs
to walk down to their descendents and in turn it needs garbageous
corresponding additional codes for all type of nodes which can
have children.

Instead, in the second patch, I modified ExecProcNode to return
async status in EState. It will be EXEC_READY or EXEC_EOT(End of
table/No more tuple?) for non-async-capable nodes and
async-capable nodes can set it EXEC_NOT_READY, which indicates
that there could be more tuple but not available yet.

Async-aware nodes such as Append can go to the next child if the
predecessor returned EXEC_NOT_READY. If all !EXEC_EOT nodes
returned EXEC_NOT_READY, Append will wait using some signaling
mechanism (it runs busily now instead.). As an example, the
second patch modifies ExecAppend to handle it and modified
ExecSeqScan to return EXEC_NOT_READY by certain probability as an
emulation of asynchronous tuple fetching. The UNION ALL query
above returns results stirred among the tree tables as the result.

> > So, in response to ExecStartAsync, if there's no tuple currently
> > available, postgres_fdw can send a query to the remote server and
> > request a callback when the fd becomes ready-ready.  It must save the
> > callback passed to ExecStartAsync inside the PlanState someplace so
> > that when a tuple becomes available it can register that callback.
> > 
> > ExecAppend can call ExecStartAsync on each of its subplans.  For any
> > subplan where ExecStartAsync returns false, ExecAppend will just
> > execute it normally, by calling ExecProcNode repeatedly until no more
> > tuples are returned.  But for async-capable subplans, it can call
> > ExecStartAsync on all of them, and then call ExecFireCallbacks.  The
> > tuple-ready callback it passes to its child plans will take the tuple
> > provided by the child plan and store it into the Append node's slot.
> > It will then return true if, and only if, ExecFireCallbacks is being
> > invoked from ExecAppend (which it can figure out via some kind of
> > signalling either through its own PlanState or centralized signalling
> > through the EState).  That way, if ExecAppend were itself invoked
> > asynchronously, its tuple-ready callback could simply populate a slot
> > appropriately register its invoker's tuple-ready callback.  Whether
> > called synchronously or asynchronously, each invocation of as
> > asynchronous append after the first would just need to again
> > ExecStartAsync on the child that last returned a tuple.
> 
> Thanks for the attentive explanation. My concern about this is
> that the latency by synchronizing one by one for every tuple
> between the producer and the consumer. My previous patch is not
> asynchronous on every tuple so it can give a pure gain without
> loss from tuple-wise synchronization. But it looks clean and I
> like it so I'll consider this.
> 
> > It seems pretty straightforward to fit Gather into this infrastructure.
>
> Yes.

If Gather's children become a regular node struct with a name
like Worker(Node), instead of non-Node structure as it is now, we
can generalize the tuple-synchronization mecanism so that it can
be used by other nodes such as ForeginScan. Append(ForegnScan,
ForegnScan,...) with async tuple passing can average multiple
foreign servers so I suppose that it is preferable if no penalty
exists.

> > It is unclear to me how useful this is beyond ForeignScan, Gather, and
> > Append.  MergeAppend's ordering constraint makes it less useful; we
> > can asynchronously kick off the request for the next tuple before
> > returning the previous one, but we're going to need to have that tuple
> > before we can return the next one.  But it could be done.  It could
> > potentially even be applied to seq scans or index scans using some set
> > of asynchronous I/O interfaces, but I don't see how it could be
> > applied to joins or aggregates, which typically can't really proceed
> > until they get the next tuple.  They could be plugged into this
> > interface easily enough but it would only help to the extent that it
> > enabled asynchrony elsewhere in the plan tree to be pulled up towards
> > the root.
> 
> This is mainly not an argument on "asynchronous execution/start"
> but "asynchronous tuple-passing". As I showed before, a merge
> join on asynchronous and parallel children running sort *can* win
> over a hash join (if planner foresees that). If asynchronous
> tuple-passing is not so effective like MergeAppend, we can
> simplly refrain from doing that. But cost modeling for it is a
> difficult problem.
> 
> > Thoughts?
> 
> I'll try the callback framework and in-process asynchronous
> tuple-passing (like select(2)). Please wait for a while.

Finally, the two patches attached became somewhat different from
Robert's suggestion and lacks of synchronization feature. However
if this way to is not so bad, I'll build the feature on this way.

Suggestions? Thoughts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Stas Kelvich
Date:
Subject: Re: Patch: ResourceOwner optimization for tables with many partitions
Next
From: Victor Wagner
Date:
Subject: Re: Patch: Implement failover on libpq connect level.