Re: [PoC] Asynchronous execution again (which is not parallel) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [PoC] Asynchronous execution again (which is not parallel)
Date
Msg-id CA+TgmoZq1dt795nwYCSuVeDAmLv-n6vD+VYKT36zWoHK+TSCdA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Asynchronous execution again (which is not parallel)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
On Thu, Jan 21, 2016 at 4:26 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I put some consideration and trial on callbacks as a means to
> async(early)-execution.

Thanks for working on this.

>> > Suppose we equip each EState with the ability to fire "callbacks".
>> > Callbacks have the signature:
>> >
>> > typedef bool (*ExecCallback)(PlanState *planstate, TupleTableSlot
>> > *slot, void *context);
>> >
>> > Executor nodes can register immediate callbacks to be run at the
>> > earliest possible opportunity using a function like
>> > ExecRegisterCallback(estate, callback, planstate, slot, context).
>> > They can registered deferred callbacks that will be called when a file
>> > descriptor becomes ready for I/O, or when the process latch is set,
>> > using a call like ExecRegisterFileCallback(estate, fd, event,
>> > callback, planstate, slot, context) or
>> > ExecRegisterLatchCallback(estate, callback, planstate, slot, context).
>
> I considered on this. The immediate callbacks seems fine but
> using latch or fds to signal tuple availability doesn't seem to
> fit callbacks stored in estate. They are deferrable until
> parent's tuple request and such kind of events can be handled at
> the time as ExecGather does now. However some kind of
> synchronize/waiting mechanism like latch or select() is needed
> anyway.

I am not entirely sure I understand what you are trying to say here,
but if I do understand it then I disagree.  Consider an Append node
with 1000 children, each a ForeignScan.  What we want to do is fire
off all 1000 remote queries and then return a tuple from whichever
ForeignScan becomes ready first.  What we do NOT want to do is fire
off all 1000 remote queries and have to either (a) iterate through all
1000 subplans checking repeatedly whether each is ready or (b) pick
one of those 1000 subplans to wait for and ignore the other 999 until
the one we pick is ready.  I don't see any way to achieve what we need
here without some way to convince the executor to do a select() across
all 1000 fds and take some action based on which one becomes
read-ready.

> Callback is usable for not-so-common invoked-for-a-event-at-once
> operations such like error-handling. For this case, the
> operations can be asynch-execution of a node and the event can be
> just before ExecProcNode on the topmost node. The first patch
> attached allows async-capable nodes to register callbacks on Init
> phase and executes them just before Exec phase on the topmost
> node. It grately reduces the additional code as the result. My
> first impression from the word "callbacks" is this.

This strikes me as pretty much uninteresting.  I bet if you test this
you'll find that it doesn't make anything faster.  You're kicking
things off asynchronously only a tiny fraction of a second before you
would have started them anyway.  What we need is a way to proceed
asynchronously through the entire execution of the query, not just at
the beginning.  I understand that your merge-join-double-gather
example can benefit from this, but it does so by kicking off both
subplans before we know for sure that we'll want to read from both
subplans.  I admit that optimization rarely kicks in, but it's a
potentially huge savings of effort when it does.

> Instead, in the second patch, I modified ExecProcNode to return
> async status in EState. It will be EXEC_READY or EXEC_EOT(End of
> table/No more tuple?) for non-async-capable nodes and
> async-capable nodes can set it EXEC_NOT_READY, which indicates
> that there could be more tuple but not available yet.
>
> Async-aware nodes such as Append can go to the next child if the
> predecessor returned EXEC_NOT_READY. If all !EXEC_EOT nodes
> returned EXEC_NOT_READY, Append will wait using some signaling
> mechanism (it runs busily now instead.). As an example, the
> second patch modifies ExecAppend to handle it and modified
> ExecSeqScan to return EXEC_NOT_READY by certain probability as an
> emulation of asynchronous tuple fetching. The UNION ALL query
> above returns results stirred among the tree tables as the result.

I think the idea of distinguishing between "end of tuples" and "no
tuples currently ready" is a good one.  I am not particularly excited
about this particular signalling mechanism.  I am not sure why you
want to put this in the EState - isn't that shared between all nodes
in the plan tree, and doesn't that create therefore the risk of
confusion?  What I might imagine is giving ExecProcNode a second
argument that functions as an out parameter and is only meaningful
when TupIsNull(result).  It could for example be a boolean indicating
whether more tuples might become available later.  Maybe an enum is
better, but let's suppose a Boolean for now.  At the top of
ExecProcNode we would do *async_pending = false.  Then, we'd pass
async_pending to ExecWhatever function that is potentially
async-capable and it could set *async_pending = true if it wants.

>> Thanks for the attentive explanation. My concern about this is
>> that the latency by synchronizing one by one for every tuple
>> between the producer and the consumer. My previous patch is not
>> asynchronous on every tuple so it can give a pure gain without
>> loss from tuple-wise synchronization. But it looks clean and I
>> like it so I'll consider this.
>>
>> > It seems pretty straightforward to fit Gather into this infrastructure.
>>
>> Yes.
>
> If Gather's children become a regular node struct with a name
> like Worker(Node), instead of non-Node structure as it is now, we
> can generalize the tuple-synchronization mecanism so that it can
> be used by other nodes such as ForeginScan. Append(ForegnScan,
> ForegnScan,...) with async tuple passing can average multiple
> foreign servers so I suppose that it is preferable if no penalty
> exists.

I don't quite understand what you mean by saying that Gather's
children are not a "regular node struct".  Which struct are you
talking about?

I do agree that transferring tuples one by one seems like it might be
inefficient.  Gather suffered from a problem of this nature which was
repaired, at least partially, by commit
bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.  I'm wondering if we need
some kind of tuple-buffering abstraction.  A tuple-buffer could have
an on-receipt-of-tuples callback; if it doesn't, then the tuples can
be read out synchronously one by one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kouhei Kaigai
Date:
Subject: Re: CustomScan under the Gather node?
Next
From: Peter Eisentraut
Date:
Subject: Re: remove wal_level archive