Thread: Re: asynchronous execution

Re: asynchronous execution

From
Robert Haas
Date:
[ Adjusting subject line to reflect the actual topic of discussion better. ]

On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> For e.g., in the above plan which you specified, suppose :
>> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
>> is
>> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
>> 2. The event wait list already has foreign scan on a that is on a different
>> subtree.
>> 3. This foreign scan a happens to be ready, so in
>> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
>> which returns with result_ready.
>> 4. Since it returns result_ready, it's parent node is now inserted in the
>> callbacks array, and so it's parent (Append) is executed.
>> 5. But, this Append planstate is already in the middle of executing Hash
>> join, and is waiting for HashJoin.
>
> Ah, yeah, something like that could happen.  I've spent much of this
> week working on a new design for this feature which I think will avoid
> this problem.  It doesn't work yet - in fact I can't even really test
> it yet.  But I'll post what I've got by the end of the day today so
> that anyone who is interested can look at it and critique.

Well, I promised to post this, so here it is.  It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface.  Hopefully that
means that anything other than those two nodes will suffer no
performance impact.  Of course, it might have other problems....

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet.  Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: asynchronous execution

From
Amit Khandekar
Date:
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems....

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation. It will not call
ExecAsyncWaitForNode() unless that node supports asynchronism. Do you
feel there is anywhere else in the implementation that is really
causing this degrade ? That previous implementation has some issues,
but they seemed solvable. We could resolve the plan state recursion
issue by explicitly making sure the same plan state does not get
called again while it is already executing.

Thanks
-Amit Khandekar



Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Sorry for delayed response, I'll have enough time from now and
address this.

At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>
> Well, I promised to post this, so here it is.  It's not really working
> all that well at this point, and it's definitely not doing anything
> that interesting, but you can see the outline of what I have in mind.
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems....
> 
> Some notes:
> 
> - EvalPlanQual rechecks are broken.
> - EXPLAIN ANALYZE instrumentation is broken.
> - ExecReScanAppend is broken, because the async stuff needs some way
> of canceling an async request and I didn't invent anything like that
> yet.
> - The postgres_fdw changes pretend to be async but aren't actually.
> It's just a demo of (part of) the interface at this point.
> - The postgres_fdw changes also report all pg-fdw paths as
> async-capable, but actually the direct-modify ones aren't, so the
> regression tests fail.
> - Errors in the executor can leak the WaitEventSet.  Probably we need
> to modify ResourceOwners to be able to own WaitEventSets.
> - There are probably other bugs, too.
> 
> Whee!
> 
> Note that I've tried to solve the re-entrancy problems by (1) putting
> all of the event loop's state inside the EState rather than in local
> variables and (2) having the function that is called to report arrival
> of a result be thoroughly different than the function that is used to
> return a tuple to a synchronous caller.
> 
> Comments welcome, if you're feeling brave enough to look at anything
> this half-baked.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for the comment.

At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in
<CAJ3gD9fRmEhUoBMnNN8K_QrHZf7m4rmOHTFDj492oeLZff8o=w@mail.gmail.com>
> On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
> > Since Kyotaro Horiguchi found that my previous design had a
> > system-wide performance impact due to the ExecProcNode changes, I
> > decided to take a different approach here: I created an async
> > infrastructure where both the requestor and the requestee have to be
> > specifically modified to support parallelism, and then modified Append
> > and ForeignScan to cooperate using the new interface.  Hopefully that
> > means that anything other than those two nodes will suffer no
> > performance impact.  Of course, it might have other problems....
> 
> I see that the reason why you re-designed the asynchronous execution
> implementation is because the earlier implementation showed
> performance degradation in local sequential and local parallel scans.
> But I checked that the ExecProcNode() changes were not that
> significant as to cause the degradation.

The basic thought is that we don't allow degradation of as small
as around one percent for simple cases in exchange for this
feature (or similar ones).

Very simple case of SeqScan runs through a very short path, on
where prediction failure penalties of CPU by few branches results
in visible impact. I avoided that by using likely/unlikly but
more fundamental measure is preferable.

> It will not call ExecAsyncWaitForNode() unless that node
> supports asynchronism.

That's true, but it takes a certain amount of CPU cycle to decide
call it or not. The small bit of time is the issue in focus now.

> Do you feel there is anywhere else in
> the implementation that is really causing this degrade ? That
> previous implementation has some issues, but they seemed
> solvable. We could resolve the plan state recursion issue by
> explicitly making sure the same plan state does not get called
> again while it is already executing.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you for the thought.


At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>
> [ Adjusting subject line to reflect the actual topic of discussion better. ]
> 
> On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >> For e.g., in the above plan which you specified, suppose :
> >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
> >> is
> >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
> >> 2. The event wait list already has foreign scan on a that is on a different
> >> subtree.
> >> 3. This foreign scan a happens to be ready, so in
> >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
> >> which returns with result_ready.
> >> 4. Since it returns result_ready, it's parent node is now inserted in the
> >> callbacks array, and so it's parent (Append) is executed.
> >> 5. But, this Append planstate is already in the middle of executing Hash
> >> join, and is waiting for HashJoin.
> >
> > Ah, yeah, something like that could happen.  I've spent much of this
> > week working on a new design for this feature which I think will avoid
> > this problem.  It doesn't work yet - in fact I can't even really test
> > it yet.  But I'll post what I've got by the end of the day today so
> > that anyone who is interested can look at it and critique.
> 
> Well, I promised to post this, so here it is.  It's not really working
> all that well at this point, and it's definitely not doing anything
> that interesting, but you can see the outline of what I have in mind.
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems....

The previous framework didn't need to distinguish async-capable
and uncapable nodes from the parent node's view. The things in
ExecProcNode was required for the reason. Instead, this new one
removes the ExecProcNode stuff by distinguishing the two kinds of
node in async-aware parents, that is, Append. This no longer
involves async-unaware nodes into the tuple bubbling-up mechanism
so the reentrant problem doesn't seem to occur.

On the other hand, for example, the following plan, regrardless
of its practicality, (there should be more good example..)

(Async-unaware node)-  NestLoop   - Append     - n * ForegnScan   - Append     - n * ForegnScan

If the NestLoop, Append are async-aware, all of the ForeignScans
can run asynchronously with the previous framework. The topmost
NestLoop will be awakened after that firing of any ForenScans
makes a tuple bubbles up to the NestLoop. This is because the
not-need-to-distinguish-aware-or-not nature provided by the
ExecProcNode stuff.

On the other hand, with the new one, in order to do the same
thing, ExecAppend have in turn to behave differently whether the
parent is async or not. To do this will be bothersome but not
with confidence.

I examine this further intensively, especially for performance
degeneration and obstacles to complete this.


> Some notes:
> 
> - EvalPlanQual rechecks are broken.
> - EXPLAIN ANALYZE instrumentation is broken.
> - ExecReScanAppend is broken, because the async stuff needs some way
> of canceling an async request and I didn't invent anything like that
> yet.
> - The postgres_fdw changes pretend to be async but aren't actually.
> It's just a demo of (part of) the interface at this point.
> - The postgres_fdw changes also report all pg-fdw paths as
> async-capable, but actually the direct-modify ones aren't, so the
> regression tests fail.
> - Errors in the executor can leak the WaitEventSet.  Probably we need
> to modify ResourceOwners to be able to own WaitEventSets.
> - There are probably other bugs, too.
> 
> Whee!
> 
> Note that I've tried to solve the re-entrancy problems by (1) putting
> all of the event loop's state inside the EState rather than in local
> variables and (2) having the function that is called to report arrival
> of a result be thoroughly different than the function that is used to
> return a tuple to a synchronous caller.
> 
> Comments welcome, if you're feeling brave enough to look at anything
> this half-baked.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: asynchronous execution

From
Robert Haas
Date:
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
>> Since Kyotaro Horiguchi found that my previous design had a
>> system-wide performance impact due to the ExecProcNode changes, I
>> decided to take a different approach here: I created an async
>> infrastructure where both the requestor and the requestee have to be
>> specifically modified to support parallelism, and then modified Append
>> and ForeignScan to cooperate using the new interface.  Hopefully that
>> means that anything other than those two nodes will suffer no
>> performance impact.  Of course, it might have other problems....
>
> I see that the reason why you re-designed the asynchronous execution
> implementation is because the earlier implementation showed
> performance degradation in local sequential and local parallel scans.
> But I checked that the ExecProcNode() changes were not that
> significant as to cause the degradation.

I think we need some testing to prove that one way or the other.  If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great.  I don't think we can just rely on "it doesn't seem like it
should be slower", though - ExecProcNode() is too important a function
for us to guess at what the performance will be.

The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes.  For example, if you imagine a plan like this:

Append
-> Hash Join -> Foreign Scan -> Hash   -> Seq Scan
-> Hash Join -> Foreign Scan -> Hash   -> Seq Scan

In order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join.  That's true in
either approach.  Things are slightly better with the original
approach, but the basic problem is there in both cases.  So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: asynchronous execution

From
Amit Khandekar
Date:
On 4 October 2016 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Since Kyotaro Horiguchi found that my previous design had a
>>> system-wide performance impact due to the ExecProcNode changes, I
>>> decided to take a different approach here: I created an async
>>> infrastructure where both the requestor and the requestee have to be
>>> specifically modified to support parallelism, and then modified Append
>>> and ForeignScan to cooperate using the new interface.  Hopefully that
>>> means that anything other than those two nodes will suffer no
>>> performance impact.  Of course, it might have other problems....
>>
>> I see that the reason why you re-designed the asynchronous execution
>> implementation is because the earlier implementation showed
>> performance degradation in local sequential and local parallel scans.
>> But I checked that the ExecProcNode() changes were not that
>> significant as to cause the degradation.
>
> I think we need some testing to prove that one way or the other.  If
> you can do some - say on a plan with multiple nested loop joins with
> inner index-scans, which will call ExecProcNode() a lot - that would
> be great.  I don't think we can just rely on "it doesn't seem like it
> should be slower"
Agreed. I will come up with some tests.

> , though - ExecProcNode() is too important a function
> for us to guess at what the performance will be.

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

>
> The thing I'm really worried about with either implementation is what
> happens when we start to add asynchronous capability to multiple
> nodes.  For example, if you imagine a plan like this:
>
> Append
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
>     -> Seq Scan
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
>     -> Seq Scan
>
> In order for this to run asynchronously, you need not only Append and
> Foreign Scan to be async-capable, but also Hash Join.  That's true in
> either approach.  Things are slightly better with the original
> approach, but the basic problem is there in both cases.  So it seems
> we need an approach that will make adding async capability to a node
> really cheap, which seems like it might be a problem.

Yes, we might have to deal with this.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: asynchronous execution

From
Robert Haas
Date:
On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Also, parent pointers are not required in the new design. Thinking of
> parent pointers, now it seems the event won't get bubbled up the tree
> with the new design. But still, , I think it's possible to switch over
> to the other asynchronous tree when some node in the current subtree
> is waiting. But I am not sure, will think more on that.

The bubbling-up still happens, because each node that made an async
request gets a callback with the final response - and if it is itself
the recipient of an async request, it can use that callback to respond
to that request in turn.  This version doesn't bubble up through
non-async-aware nodes, but that might be a good thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello, this works but ExecAppend gets a bit degradation.

At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161003.194632.204401048.horiguchi.kyotaro@lab.ntt.co.jp>
> > Some notes:
> > 
> > - EvalPlanQual rechecks are broken.

This is fixed by adding (restoring) async-cancelation.

> > - EXPLAIN ANALYZE instrumentation is broken.

EXPLAIN ANALYE seems working but async-specific information is
not available yet.

> > - ExecReScanAppend is broken, because the async stuff needs some way
> > of canceling an async request and I didn't invent anything like that
> > yet.

Fixed as EvalPlanQual.

> > - The postgres_fdw changes pretend to be async but aren't actually.
> > It's just a demo of (part of) the interface at this point.

Applied my previous patch with some modification.

> > - The postgres_fdw changes also report all pg-fdw paths as
> > async-capable, but actually the direct-modify ones aren't, so the
> > regression tests fail.

All actions other than scan does vacate_connection() to use a
connection.

> > - Errors in the executor can leak the WaitEventSet.  Probably we need
> > to modify ResourceOwners to be able to own WaitEventSets.

WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)

> > - There are probably other bugs, too.
> > 
> > Whee!
> > 
> > Note that I've tried to solve the re-entrancy problems by (1) putting
> > all of the event loop's state inside the EState rather than in local
> > variables and (2) having the function that is called to report arrival
> > of a result be thoroughly different than the function that is used to
> > return a tuple to a synchronous caller.
> > 
> > Comments welcome, if you're feeling brave enough to look at anything
> > this half-baked.

This doesn't cause reentry since this no longer bubbles up
tupples through async-unaware nodes. This framework passes tuples
through private channels for requestor and requestees.

Anyway, I amended this and made postgres_fdw async and then
finally all regtests passed with minor modifications. The
attached patches are the following.

0001-robert-s-2nd-framework.patchThe patch Robert shown upthread

0002-Fix-some-bugs.patchA small patch to fix complation errors of 0001.

0003-Modify-async-execution-infrastructure.patchSeveral modifications on the infrastructure. The details areshown after
themeasurement below.
 

0004-Make-postgres_fdw-async-capable.patchTrue-async postgres-fdw.

gentblr.sql, testrun.sh, calc.plPerformance test script suite.
  gentblr.sql - creates test tables.  testrun.sh - does single test run and  calc.pl - drives testrunc.sh and summirize
itsresults.
 


I measured performance and had the following result.

t0  - SELECT sum(a) FROM <local single table>;
pl  - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync t0: 3820.33 (  1.88) pl: 1608.59 ( 12.06)pf0: 7928.29 ( 46.58)pf1: 8023.16 ( 26.43)

async t0: 3806.31 (  4.49)    0.4% faster (should be error) pl: 1629.17 (  0.29)    1.3% slowerpf0: 6447.07 ( 25.19)
18.7%fasterpf1: 1876.80 ( 47.13)   76.6% faster
 

t0 is not affected since the ExecProcNode stuff has gone.

pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.

pf0, pf1 are faster as expected.


========

The below is a summary of modifications made by 0002 and 0003 patch.

execAsync.c, execnodes.h:
 - Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT.
 - Changed the interface of ExecAsyncRequest to return if a tuple is   immediately available or not.
 - Made ExecAsyncConfigureWait to return if it registered at least   one waitevent or not. This is used to know the
caller  (ExecAsyncEventWait) has a event to wait (for safety).
 
   If two or more postgres_fdw nodes are sharing one connection,   only one of them can be waited at once. It is a
responsibilityto the FDW drivers to ensure at least one wait   event to be added but on failure WaitEventSetWait
silently  waits forever.
 
 - There were separate areq->callback_pending and   areq->request_complete but they are altering together so they are
replacedwith one state variable areq->state. New enum   AsyncRequestState for areq->state in execnodes.h.
 


nodeAppend.c:
 - Return a tuple immediately if ExecAsyncRequest says that a   tuple is available.
 - Reduced nest level of for(;;).

nodeForeignscan.[ch], fdwapi.h, execProcnode.c::
 - Calling postgresIterateForeignScan can yield tuples in wrong   shape. Call ExecForeignScan instead.
 - Changed the interface of AsyncConfigureWait as execAsync.c.
 - Added ShutdownForeignScan interface.

createplan.c, ruleutils.c, plannodes.h:
 - With the Rebert's change, explain shows somewhat odd plans   where the Output of Append is named after non-parent
child.This does not harm but uneasy. Added index of the   parent in Append.referent to make it reasoable. (But this
looksugly..). Still children in explain are in different   order from definition. (expected/postgres_fdw.out is
edited)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
This is the rebased version on the current master(-0004), and
added resowner stuff (0005) and unlikely(0006).

At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp>
> > > - Errors in the executor can leak the WaitEventSet.  Probably we need
> > > to modify ResourceOwners to be able to own WaitEventSets.
> 
> WaitEventSet itself is not leaked but epoll-fd should be closed
> at failure. This seems doable with TRY-CATCHing in
> ExecAsyncEventLoop. (not yet)

Haha, that's a silly talk. The wait event can continue to live
when timeout and any error can happen on the way after the
that. I added an entry for wait event set to resource owner and
hang ones created in ExecAsyncEventWait to
TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't
do so not to change the current behavior. WaitEventSet doesn't
have usable identifier for resowner.c so currently I use the
address(pointer value) for the purpose. The patch 0005 does that.

> I measured performance and had the following result.
> 
> t0  - SELECT sum(a) FROM <local single table>;
> pl  - SELECT sum(a) FROM <4 local children>;
> pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> 
> The result is written as "time<ms> (std dev <ms>)"
> 
> sync
>   t0: 3820.33 (  1.88)
>   pl: 1608.59 ( 12.06)
>  pf0: 7928.29 ( 46.58)
>  pf1: 8023.16 ( 26.43)
> 
> async
>   t0: 3806.31 (  4.49)    0.4% faster (should be error)
>   pl: 1629.17 (  0.29)    1.3% slower
>  pf0: 6447.07 ( 25.19)   18.7% faster
>  pf1: 1876.80 ( 47.13)   76.6% faster
> 
> t0 is not affected since the ExecProcNode stuff has gone.
> 
> pl is getting a bit slower. (almost the same to simple seqscan of
> the previous patch) This should be a misprediction penalty.

Using likely macro for ExecAppend, and it seems to have shaken
off the degradation.

sync t0: 3919.49 (  5.95) pl: 1637.95 (  0.75)pf0: 8304.20 ( 43.94)pf1: 8222.09 ( 28.20)

async t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..) pl: 1617.20 (  3.51)  1.26% faster
(ditto)pf0:6680.95 (478.72)  19.5% fasterpf1: 1886.87 ( 36.25)  77.1% faster
 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hi, this is the 7th patch to make instrumentation work.

Explain analyze shows the following result by the previous patch set .

| Aggregate  (cost=820.25..820.26 rows=1 width=8) (actual time=4324.676..4324.676
| rows=1 loops=1)
|   ->  Append  (cost=0.00..791.00 rows=11701 width=4) (actual time=0.910..3663.8
|82 rows=4000000 loops=1)
|     ->  Foreign Scan on ft10  (cost=100.00..197.75 rows=2925 width=4)
|                               (never executed)
|     ->  Foreign Scan on ft20  (cost=100.00..197.75 rows=2925 width=4)
|                               (never executed)
|     ->  Foreign Scan on ft30  (cost=100.00..197.75 rows=2925 width=4)
|                               (never executed)
|     ->  Foreign Scan on ft40  (cost=100.00..197.75 rows=2925 width=4)
|                               (never executed)
|     ->  Seq Scan on pf0  (cost=0.00..0.00 rows=1 width=4)
|                          (actual time=0.004..0.004 rows=0 loops=1)

The current instrument stuff assumes that requested tuple always
returns a tuple or the end of tuple comes. This async framework
has two point of executing underneath nodes. ExecAsyncRequest and
ExecAsyncEventLoop. So I'm not sure if this is appropriate but
anyway it seems to show sane numbers.

| Aggregate  (cost=820.25..820.26 rows=1 width=8) (actual time=4571.205..4571.206
| rows=1 loops=1)
|   ->  Append  (cost=0.00..791.00 rows=11701 width=4) (actual time=1.362..3893.1
|14 rows=4000000 loops=1)
|     ->  Foreign Scan on ft10  (cost=100.00..197.75 rows=2925 width=4)
|                        (actual time=1.056..770.863 rows=1000000 loops=1)
|     ->  Foreign Scan on ft20  (cost=100.00..197.75 rows=2925 width=4)
|                        (actual time=0.461..767.840 rows=1000000 loops=1)
|     ->  Foreign Scan on ft30  (cost=100.00..197.75 rows=2925 width=4)
|                        (actual time=0.474..782.547 rows=1000000 loops=1)
|     ->  Foreign Scan on ft40  (cost=100.00..197.75 rows=2925 width=4)
|                        (actual time=0.156..765.920 rows=1000000 loops=1)
|     ->  Seq Scan on pf0  (cost=0.00..0.00 rows=1 width=4) (never executed)


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

I'm not sure this is in a sutable shape for commit fest but I
decided to register this to ride on the bus for 10.0.

> Hi, this is the 7th patch to make instrumentation work.

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodesregister its async-capable children and sync and data
transferaredone out of band of ordinary ExecProcNode channel. So asyncexecution no longer disturbs async-unaware node
andslows themdown.
 

0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shapeof 0001 patch.

0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreigntables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,that WaitEventSet used for async execution can leak by
errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction ofbranches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveystuples outside ExecProcNode channel so EXPLAIN ANALYZE
requiresspecialtreat to show sane results. This patch tries that.
 


A result of a performance measurement is in this message.

https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp


| t0  - SELECT sum(a) FROM <local single table>;
| pl  - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
|   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
|   pl: 1617.20 (  3.51)  1.26% faster (ditto)
|  pf0: 6680.95 (478.72)  19.5% faster
|  pf1: 1886.87 ( 36.25)  77.1% faster

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello, this is a maintenance post of reased patches.
I added a change of ResourceOwnerData missed in 0005.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
> This a PoC patch of asynchronous execution feature, based on a
> executor infrastructure Robert proposed. These patches are
> rebased on the current master.
> 
> 0001-robert-s-2nd-framework.patch
> 
>  Roberts executor async infrastructure. Async-driver nodes
>  register its async-capable children and sync and data transfer
>  are done out of band of ordinary ExecProcNode channel. So async
>  execution no longer disturbs async-unaware node and slows them
>  down.
> 
> 0002-Fix-some-bugs.patch
> 
>  Some fixes for 0001 to work. This is just to preserve the shape
>  of 0001 patch.
> 
> 0003-Modify-async-execution-infrastructure.patch
> 
>  The original infrastructure doesn't work when multiple foreign
>  tables is on the same connection. This makes it work.
> 
> 0004-Make-postgres_fdw-async-capable.patch
> 
>  Makes postgres_fdw to work asynchronously.
> 
> 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
> 
>  This addresses a problem pointed by Robers about 0001 patch,
>  that WaitEventSet used for async execution can leak by errors.
> 
> 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
> 
>  ExecAppend gets a bit slower by penalties of misprediction of
>  branches. This fixes it by using unlikely() macro.
> 
> 0007-Add-instrumentation-to-async-execution.patch
> 
>  As the description above for 0001, async infrastructure conveys
>  tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
>  special treat to show sane results. This patch tries that.
> 
> 
> A result of a performance measurement is in this message.
> 
> https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
> 
> 
> | t0  - SELECT sum(a) FROM <local single table>;
> | pl  - SELECT sum(a) FROM <4 local children>;
> | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> ...
> | async
> |   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
> |   pl: 1617.20 (  3.51)  1.26% faster (ditto)
> |  pf0: 6680.95 (478.72)  19.5% faster
> |  pf1: 1886.87 ( 36.25)  77.1% faster

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

I cannot respond until next Monday, so I move this to the next CF
by myself.

At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161115.202513.268072050.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello, this is a maintenance post of reased patches.
> I added a change of ResourceOwnerData missed in 0005.
> 
> At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
> > This a PoC patch of asynchronous execution feature, based on a
> > executor infrastructure Robert proposed. These patches are
> > rebased on the current master.
> > 
> > 0001-robert-s-2nd-framework.patch
> > 
> >  Roberts executor async infrastructure. Async-driver nodes
> >  register its async-capable children and sync and data transfer
> >  are done out of band of ordinary ExecProcNode channel. So async
> >  execution no longer disturbs async-unaware node and slows them
> >  down.
> > 
> > 0002-Fix-some-bugs.patch
> > 
> >  Some fixes for 0001 to work. This is just to preserve the shape
> >  of 0001 patch.
> > 
> > 0003-Modify-async-execution-infrastructure.patch
> > 
> >  The original infrastructure doesn't work when multiple foreign
> >  tables is on the same connection. This makes it work.
> > 
> > 0004-Make-postgres_fdw-async-capable.patch
> > 
> >  Makes postgres_fdw to work asynchronously.
> > 
> > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
> > 
> >  This addresses a problem pointed by Robers about 0001 patch,
> >  that WaitEventSet used for async execution can leak by errors.
> > 
> > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
> > 
> >  ExecAppend gets a bit slower by penalties of misprediction of
> >  branches. This fixes it by using unlikely() macro.
> > 
> > 0007-Add-instrumentation-to-async-execution.patch
> > 
> >  As the description above for 0001, async infrastructure conveys
> >  tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
> >  special treat to show sane results. This patch tries that.
> > 
> > 
> > A result of a performance measurement is in this message.
> > 
> > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
> > 
> > 
> > | t0  - SELECT sum(a) FROM <local single table>;
> > | pl  - SELECT sum(a) FROM <4 local children>;
> > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> > ...
> > | async
> > |   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
> > |   pl: 1617.20 (  3.51)  1.26% faster (ditto)
> > |  pf0: 6680.95 (478.72)  19.5% faster
> > |  pf1: 1886.87 ( 36.25)  77.1% faster

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
This patch conflicts with e13029a (es_query_dsa) so I rebased
this.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
> This a PoC patch of asynchronous execution feature, based on a
> executor infrastructure Robert proposed. These patches are
> rebased on the current master.
> 
> 0001-robert-s-2nd-framework.patch
> 
>  Roberts executor async infrastructure. Async-driver nodes
>  register its async-capable children and sync and data transfer
>  are done out of band of ordinary ExecProcNode channel. So async
>  execution no longer disturbs async-unaware node and slows them
>  down.
> 
> 0002-Fix-some-bugs.patch
> 
>  Some fixes for 0001 to work. This is just to preserve the shape
>  of 0001 patch.
> 
> 0003-Modify-async-execution-infrastructure.patch
> 
>  The original infrastructure doesn't work when multiple foreign
>  tables is on the same connection. This makes it work.
> 
> 0004-Make-postgres_fdw-async-capable.patch
> 
>  Makes postgres_fdw to work asynchronously.
> 
> 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
> 
>  This addresses a problem pointed by Robers about 0001 patch,
>  that WaitEventSet used for async execution can leak by errors.
> 
> 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
> 
>  ExecAppend gets a bit slower by penalties of misprediction of
>  branches. This fixes it by using unlikely() macro.
> 
> 0007-Add-instrumentation-to-async-execution.patch
> 
>  As the description above for 0001, async infrastructure conveys
>  tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
>  special treat to show sane results. This patch tries that.
> 
> 
> A result of a performance measurement is in this message.
> 
> https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
> 
> 
> | t0  - SELECT sum(a) FROM <local single table>;
> | pl  - SELECT sum(a) FROM <4 local children>;
> | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> ...
> | async
> |   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
> |   pl: 1617.20 (  3.51)  1.26% faster (ditto)
> |  pf0: 6680.95 (478.72)  19.5% faster
> |  pf1: 1886.87 ( 36.25)  77.1% faster

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
> This a PoC patch of asynchronous execution feature, based on a
> executor infrastructure Robert proposed. These patches are
> rebased on the current master.
> 
> 0001-robert-s-2nd-framework.patch
> 
>  Roberts executor async infrastructure. Async-driver nodes
>  register its async-capable children and sync and data transfer
>  are done out of band of ordinary ExecProcNode channel. So async
>  execution no longer disturbs async-unaware node and slows them
>  down.
> 
> 0002-Fix-some-bugs.patch
> 
>  Some fixes for 0001 to work. This is just to preserve the shape
>  of 0001 patch.
> 
> 0003-Modify-async-execution-infrastructure.patch
> 
>  The original infrastructure doesn't work when multiple foreign
>  tables is on the same connection. This makes it work.
> 
> 0004-Make-postgres_fdw-async-capable.patch
> 
>  Makes postgres_fdw to work asynchronously.
> 
> 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
> 
>  This addresses a problem pointed by Robers about 0001 patch,
>  that WaitEventSet used for async execution can leak by errors.
> 
> 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
> 
>  ExecAppend gets a bit slower by penalties of misprediction of
>  branches. This fixes it by using unlikely() macro.
> 
> 0007-Add-instrumentation-to-async-execution.patch
> 
>  As the description above for 0001, async infrastructure conveys
>  tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
>  special treat to show sane results. This patch tries that.
> 
> 
> A result of a performance measurement is in this message.
> 
> https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
> 
> 
> | t0  - SELECT sum(a) FROM <local single table>;
> | pl  - SELECT sum(a) FROM <4 local children>;
> | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> ...
> | async
> |   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
> |   pl: 1617.20 (  3.51)  1.26% faster (ditto)
> |  pf0: 6680.95 (478.72)  19.5% faster
> |  pf1: 1886.87 ( 36.25)  77.1% faster


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Michael Paquier
Date:
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

The patches still apply, moved to CF 2017-03. Be aware of that:
$ git diff HEAD~6 --check
contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
+                           PendingAsyncRequest *areq,
contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
+                           bool reinit);
src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.
-- 
Michael



Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you.

At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqS0MhZrzgMVQeFEnnKABcsMnNULd8=O0PG7_h-FUp5aEQ@mail.gmail.com>
> On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
> 
> The patches still apply, moved to CF 2017-03. Be aware of that:
> $ git diff HEAD~6 --check
> contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
> +                           PendingAsyncRequest *areq,
> contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
> +                           bool reinit);
> src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.

Thank you for letting me know the command. I changed my check
scripts to use them and it seems working fine on both commit and
rebase.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] asynchronous execution

From
Antonin Houska
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----          if ((w->events & WL_LATCH_SET) != 0)         {
+             ResetLatch(MyLatch);             process_latch_set = true;             continue;         }

Actually _almost_ fixed because at some point one of the following

Assert(areq->state == ASYNC_WAITING);

statements fired. I think it was the immediately following one, but I can
imagine the same to happen in the branch

if (process_latch_set)
...

I think the wants_process_latch field of PendingAsyncRequest is not useful
alone because the process latch can be set for reasons completely unrelated to
the asynchronous processing. If the asynchronous node should use latch to
signal it's readiness, I think an additional flag is needed in the request
which tells ExecAsyncEventWait that the latch was set by the asynchronous
node.

BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the
async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave
it ASYNC_WAITING if the data is not ready.


In addition, the following comments are based only on code review, I didn't
verify my understanding experimentally:

* Isn't it possible for AppendState.as_asyncresult to contain multiple responses from the same async node? Since the
arraystores TupleTableSlot instead of the actual tuple (so multiple items of as_asyncresult point to the same slot), I
suspectthe slot contents might not be defined when the Append node eventually tries to return it to the upper plan. 

* For the WaitEvent subsystem to work, I think postgres_fdw should keep a separate libpq connection per node, not per
usermapping. Currently the connections are cached by user mapping, but it's legal to locate multiple child postgres_fdw
nodesof Append plan on the same remote server. I expect that these "co-located" nodes would currently use the same user
mappingand therefore the same connection. 

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:

On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houska <ah@cybertec.at> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----

                if ((w->events & WL_LATCH_SET) != 0)
                {
+                       ResetLatch(MyLatch);
                        process_latch_set = true;
                        continue;
                }


Hi, I've been testing this patch because seemed like it would help a use case of mine, but can't tell if it's currently working for cases other than a local parent table that has many child partitions which happen to be foreign tables. Is it? I was hoping to use it for a case like:

select x, sum(y) from one_remote_table
union all
select x, sum(y) from another_remote_table
union all
select x, sum(y) from a_third_remote_table

but while aggregates do appear to be pushed down, it seems that the remote tables are being queried in sequence. Am I doing something wrong?

Re: [HACKERS] asynchronous execution

From
Amit Langote
Date:
Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0  0x0000003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003969633e05 in *__GI_abort () at abort.c:92
#2  0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",   lineNumber=345) at assert.c:54
#3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4  0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5  0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6  0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7  0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8  0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9  0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',   numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,
<snip>

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;                     QUERY PLAN
------------------------------------------------------Sort  Sort Key: ((ptab.tableoid)::regclass)  ->  HashAggregate
   Group Key: (ptab.tableoid)::regclass, ptab.a        ->  Result              ->  Append                    ->
ForeignScan on ptab_00001                    ->  Foreign Scan on ptab_00002                    ->  Foreign Scan on
ptab_00003                   ->  Foreign Scan on ptab_00004                    ->  Foreign Scan on ptab_00005
        ->  Foreign Scan on ptab_00006                    ->  Foreign Scan on ptab_00007                    ->  Foreign
Scanon ptab_00008                    ->  Foreign Scan on ptab_00009                    ->  Foreign Scan on ptab_00010
          <snip>
 

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR:  relation "public.ptab_00010" does not exist
CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301                    lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1  0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3  0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')   at resowner.c:566
#4  0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5  0x0000000000524172 in AbortTransaction () at xact.c:2588
#6  0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7  0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8  0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9  0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600            ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = { nevents = 2139062143, nevents_space = 2139062143, resowner = 0x7f7f7f7f7f7f7f7f, events = 0x7f7f7f7f7f7f7f7f,
latch= 0x7f7f7f7f7f7f7f7f, latch_pos = 2139062143, epoll_fd = 2139062143, epoll_ret_events = 0x7f7f7f7f7f7f7f7f
 
}

Thanks,
Amit





Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp>
> Horiguchi-san,
> 
> On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
> 
> With the latest set of patches, I observe a crash due to an Assert failure:
> 
> #3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

> I was running a query whose plan looked like:
> 
> explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> group by 1,2 order by 1;
>                       QUERY PLAN
> ------------------------------------------------------
>  Sort
>    Sort Key: ((ptab.tableoid)::regclass)
>    ->  HashAggregate
>          Group Key: (ptab.tableoid)::regclass, ptab.a
>          ->  Result
>                ->  Append
>                      ->  Foreign Scan on ptab_00001
>                      ->  Foreign Scan on ptab_00002
>                      ->  Foreign Scan on ptab_00003
>                      ->  Foreign Scan on ptab_00004
>                      ->  Foreign Scan on ptab_00005
>                      ->  Foreign Scan on ptab_00006
>                      ->  Foreign Scan on ptab_00007
>                      ->  Foreign Scan on ptab_00008
>                      ->  Foreign Scan on ptab_00009
>                      ->  Foreign Scan on ptab_00010
>                <snip>
> 
> The snipped part contains Foreign Scans on 90 more foreign partitions (in
> fact, I could see the crash even with 10 foreign table partitions for the
> same query).

Yeah, it seems to me unrelated to how many they are.

> There is a crash in one more case, which seems related to how WaitEventSet
> objects are manipulated during resource-owner-mediated cleanup of a failed
> query, such as after the FDW returned an error like below:
> 
> ERROR:  relation "public.ptab_00010" does not exist
> CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010
> 
> The backtrace in this looks like below:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> 301                    lastidx = resarr->lastidx;
> (gdb)
> (gdb) bt
> #0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> #1  0x00000000009c6578 in ResourceOwnerForgetWES
> (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
> #2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> #3  0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
>     at resowner.c:566
> #4  0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
> '\001') at resowner.c:485
> #5  0x0000000000524172 in AbortTransaction () at xact.c:2588
> #6  0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
> #7  0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
> dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
> #8  0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
> #9  0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
> #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
> #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
> postmaster.c:1330
> #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
> 
> There is a segfault when accessing the events variable, whose members seem
> to be pfreed:
> 
> (gdb) f 2
> #2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> 600            ResourceOwnerForgetWES(set->resowner, set);
> (gdb) p *set
> $5 = {
>   nevents = 2139062143,
>   nevents_space = 2139062143,
>   resowner = 0x7f7f7f7f7f7f7f7f,
>   events = 0x7f7f7f7f7f7f7f7f,
>   latch = 0x7f7f7f7f7f7f7f7f,
>   latch_pos = 2139062143,
>   epoll_fd = 2139062143,
>   epoll_ret_events = 0x7f7f7f7f7f7f7f7f
> }

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Thu, 16 Feb 2017 21:06:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170216.210600.214980879.horiguchi.kyotaro@lab.ntt.co.jp>
> > #3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> > timeout=-1) at execAsync.c:345
> 
> This means no pending fdw scan didn't let itself go to waiting
> stage. It leads to a stuck of the whole things. This is caused if
> no one acutually is waiting for result. I suppose that all of the
> foreign scans ran on the same connection. Anyway it should be a
> mistake in state transition. I'll look into it.
...
> > I was running a query whose plan looked like:
> > 
> > explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> > group by 1,2 order by 1;
> >                       QUERY PLAN
> > ------------------------------------------------------
> >  Sort
> >    Sort Key: ((ptab.tableoid)::regclass)
> >    ->  HashAggregate
> >          Group Key: (ptab.tableoid)::regclass, ptab.a
> >          ->  Result
> >                ->  Append
> >                      ->  Foreign Scan on ptab_00001
> >                      ->  Foreign Scan on ptab_00002
> >                      ->  Foreign Scan on ptab_00003
> >                      ->  Foreign Scan on ptab_00004
> >                      ->  Foreign Scan on ptab_00005
> >                      ->  Foreign Scan on ptab_00006
> >                      ->  Foreign Scan on ptab_00007
> >                      ->  Foreign Scan on ptab_00008
> >                      ->  Foreign Scan on ptab_00009
> >                      ->  Foreign Scan on ptab_00010
> >                <snip>
> > 
> > The snipped part contains Foreign Scans on 90 more foreign partitions (in
> > fact, I could see the crash even with 10 foreign table partitions for the
> > same query).
> 
> Yeah, it seems to me unrelated to how many they are.

Finally, I couldn't see the crash for the (maybe) same case. I
can guess two reasons for this. One is that a situation where
node->as_nasyncpending differs from estate->es_num_pending_async,
but I couldn't find a possibility. Another is a situation in
postgresIterateForeignScan where the "next owner" reaches eof but
another waiter is not. I haven't reproduce the situation but
fixed it for the case. Addition to that I found a bug in
ExecAsyncAppendResponse. It calls bms_add_member inappropriate
way.

> Mmm, I reproduces it quite easily. A silly bug.
> 
> Something bad is happening between freeing ExecutorState memory
> context and resource owner. Perhaps the ExecutorState is freed by
> resowner (as a part of its anscestors) before the memory for the
> WaitEventSet is freed. It was careless of me. I'll reconsider it.

The cause was that the WaitEventSet was placed in ExecutorState
but registered to TopTransactionResourceOwner. I fixed it.

This fixes are made on top of the previous patches for now. In
the attached files, 0008, 0009 are for the second bug, 0012 is
for the first bug. And 0013 is for bms bug.

Sorry for the confused patches, I will resend more neater ones
soon.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello, I totally reorganized the patch set to four pathces on the
current master (9e43e87).

At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170222.173945.262776579.horiguchi.kyotaro@lab.ntt.co.jp>
> Finally, I couldn't see the crash for the (maybe) same case. I
> can guess two reasons for this. One is that a situation where
> node->as_nasyncpending differs from estate->es_num_pending_async,
> but I couldn't find a possibility. Another is a situation in
> postgresIterateForeignScan where the "next owner" reaches eof but
> another waiter is not. I haven't reproduce the situation but
> fixed it for the case. Addition to that I found a bug in
> ExecAsyncAppendResponse. It calls bms_add_member inappropriate
> way.

This found to be wrong. The true problem here was (maybe) that
ExecAsyncRequest can complete a tuple immediately. This causes
multiple calling to ExecAsyncRequest for the same child at
once. (For the case, the processing node is added again to
node->as_needrequest before ExecAsyncRequest returns.)

Using a copy of node->as_needrequest will fix this but it is
uneasy so I changed ExecAsyncRequest not to return a tuple
immediately. Instaed, ExecAsyncEventLoop skips waiting if no node
to wait. The tuple previously "response"'ed in ExecAsyncRequest
is now responsed here.

Addition to that, the current policy of preserving of
es_wait_event_set doesn't seem to work with the async-capable
postgres_fdw. So the current code cleares it at every entering to
ExecAppend. This needs more thoughts.


I measured the performance of async-execution and it was quite
good from the previous version especially for single-connection
environment.

pf0: 4 foreign tables on single connection  non async : (prev) 7928ms -> (this time)7993ms      async : (prev) 6447ms
->(this time)3211ms
 

pf1: 4 foreign tables on dedicate connection for every table  non async : (prev) 8023ms -> (this time)7953ms      async
:(prev) 1876ms -> (this time)1841ms
 

Boost rate by async execution is 60% for single connectsion and
77% for dedicate connection environment.


> > Mmm, I reproduces it quite easily. A silly bug.
> > 
> > Something bad is happening between freeing ExecutorState memory
> > context and resource owner. Perhaps the ExecutorState is freed by
> > resowner (as a part of its anscestors) before the memory for the
> > WaitEventSet is freed. It was careless of me. I'll reconsider it.
> 
> The cause was that the WaitEventSet was placed in ExecutorState
> but registered to TopTransactionResourceOwner. I fixed it.

The attached patches are the following.

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch Allows WaitEventSet to released by resource owner.

0002-Asynchronous-execution-framework.patch Asynchronous execution framework based on Robert's version. All edits on
thisis merged.
 

0003-Make-postgres_fdw-async-capable.patch Make postgres_fdw async-capable.

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
 This can be merged to 0002 but I didn't since the usage of using these pragmas is arguable.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:

On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
9e43e87

Patch fails on current master, but correctly applies to 9e43e87. Thanks for including the commit id.

Regression tests pass.

As with my last attempt at reviewing this patch, I'm confused about what kind of queries can take advantage of this patch. Is it only cases where a local table has multiple inherited foreign table children? Will it work with queries where two foreign tables are referenced and combined with a UNION ALL?

Re: [HACKERS] asynchronous execution

From
Amit Langote
Date:
On 2017/03/11 8:19, Corey Huinker wrote:
> 
> On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp <mailto:horiguchi.kyotaro@lab.ntt.co.jp>>
> wrote:
> 
>     9e43e87
> 
> 
> Patch fails on current master, but correctly applies to 9e43e87. Thanks
> for including the commit id.
> 
> Regression tests pass.
> 
> As with my last attempt at reviewing this patch, I'm confused about what
> kind of queries can take advantage of this patch. Is it only cases where a
> local table has multiple inherited foreign table children?

IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's
driven by postgres_fdw (after building some relevant infrastructure
first).  The same might be added to other Scan nodes (and probably other
nodes as well) eventually so that more queries will benefit from
asynchronous execution.  It may just be that ForeignScan's benefit more
from asynchronous execution than other Scan types.

> Will it work
> with queries where two foreign tables are referenced and combined with a
> UNION ALL?

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append.  But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Thanks,
Amit





Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append.  But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.


Ok, I'll re-run my test from a few weeks back and see if anything has changed. 



Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com> wrote:

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append.  But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.


Ok, I'll re-run my test from a few weeks back and see if anything has changed. 


I'm not able to discern any difference in plan between a 9.6 instance and this patch.

The basic outline of my test is:

EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'

I've tried this test where tab1 through tab4 all are the same postgres_fdw foreign table.
I've tried this test where tab1 through tab4 all are different foreign tables pointing to the same remote table sharing a the same server definition.
I've tried this test where tab1 through tab4 all are different foreign tables pointing each with it's own foreign server definition, all of which happen to point to the same remote cluster.

Are there some postgresql.conf settings I should set to get a decent test?



 

Re: [HACKERS] asynchronous execution

From
Amit Langote
Date:
On 2017/03/14 6:31, Corey Huinker wrote:
> On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
>>
>>> I think it will, because Append itself has been made async-capable by one
>>> of the patches and UNION ALL uses Append.  But as mentioned above, only
>>> the postgres_fdw foreign tables will be able to utilize this for now.
>>>
>>>
>> Ok, I'll re-run my test from a few weeks back and see if anything has
>> changed.
>>
> 
> 
> I'm not able to discern any difference in plan between a 9.6 instance and
> this patch.
> 
> The basic outline of my test is:
> 
> EXPLAIN ANALYZE
> SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'
> 
> 
> I've tried this test where tab1 through tab4 all are the same postgres_fdw
> foreign table.
> I've tried this test where tab1 through tab4 all are different foreign
> tables pointing to the same remote table sharing a the same server
> definition.
> I've tried this test where tab1 through tab4 all are different foreign
> tables pointing each with it's own foreign server definition, all of which
> happen to point to the same remote cluster.
> 
> Are there some postgresql.conf settings I should set to get a decent test?

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

Thanks,
Amit





Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:
I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

Thanks,
Amit

I could see no performance improvement, even with 16 separate queries combined with UNION ALL. Query performance was always with +/- 10% of a 9.6 instance given the same script. I must be missing something.
 

Re: [HACKERS] asynchronous execution

From
Amit Langote
Date:
On 2017/03/14 10:08, Corey Huinker wrote:
>> I don't think the plan itself will change as a result of applying this
>> patch. You might however be able to observe some performance improvement.
> 
> I could see no performance improvement, even with 16 separate queries
> combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
> instance given the same script. I must be missing something.

Hmm, maybe I'm missing something too.

Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:

https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp

From that message:

<quote>
I measured performance and had the following result.

t0  - SELECT sum(a) FROM <local single table>;
pl  - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync t0: 3820.33 (  1.88) pl: 1608.59 ( 12.06)pf0: 7928.29 ( 46.58)pf1: 8023.16 ( 26.43)

async t0: 3806.31 (  4.49)    0.4% faster (should be error) pl: 1629.17 (  0.29)    1.3% slowerpf0: 6447.07 ( 25.19)
18.7%fasterpf1: 1876.80 ( 47.13)   76.6% faster
 
</quote>

IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

Thanks,
Amit





Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:
On Mon, Mar 13, 2017 at 9:28 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/03/14 10:08, Corey Huinker wrote:
>> I don't think the plan itself will change as a result of applying this
>> patch. You might however be able to observe some performance improvement.
>
> I could see no performance improvement, even with 16 separate queries
> combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
> instance given the same script. I must be missing something.

Hmm, maybe I'm missing something too.

Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:

https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp

From that message:

<quote>
I measured performance and had the following result.

t0  - SELECT sum(a) FROM <local single table>;
pl  - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync
  t0: 3820.33 (  1.88)
  pl: 1608.59 ( 12.06)
 pf0: 7928.29 ( 46.58)
 pf1: 8023.16 ( 26.43)

async
  t0: 3806.31 (  4.49)    0.4% faster (should be error)
  pl: 1629.17 (  0.29)    1.3% slower
 pf0: 6447.07 ( 25.19)   18.7% faster
 pf1: 1876.80 ( 47.13)   76.6% faster
</quote>

IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

Thanks,
Amit

I reworked the test such that all of the foreign tables inherit from the same parent table, and if you query that you do get async execution. But It doesn't work when just stringing together those foreign tables with UNION ALLs.

I don't know how to proceed with this review if that was a goal of the patch.

Re: [HACKERS] asynchronous execution

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> I reworked the test such that all of the foreign tables inherit from the
> same parent table, and if you query that you do get async execution. But It
> doesn't work when just stringing together those foreign tables with UNION
> ALLs.

> I don't know how to proceed with this review if that was a goal of the
> patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that.  The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL.  If that's not the case here, I'd want to know why.
        regards, tom lane



Re: [HACKERS] asynchronous execution

From
Corey Huinker
Date:
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> I reworked the test such that all of the foreign tables inherit from the
> same parent table, and if you query that you do get async execution. But It
> doesn't work when just stringing together those foreign tables with UNION
> ALLs.

> I don't know how to proceed with this review if that was a goal of the
> patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that.  The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL.  If that's not the case here, I'd want to know why.

                        regards, tom lane

Updated commitfest entry to "Returned With Feedback".




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in
<CADkLM=cBZEX9L9HnhJYrtfiAN5Ebdu=xbvM_poWVGBR7yN3gVw@mail.gmail.com>
> On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > Corey Huinker <corey.huinker@gmail.com> writes:
> > > I reworked the test such that all of the foreign tables inherit from the
> > > same parent table, and if you query that you do get async execution. But
> > It
> > > doesn't work when just stringing together those foreign tables with UNION
> > > ALLs.
> >
> > > I don't know how to proceed with this review if that was a goal of the
> > > patch.
> >
> > Whether it was a goal or not, I'd say there is something either broken
> > or incorrectly implemented if you don't see that.  The planner (and
> > therefore also the executor) generally treats inheritance the same as
> > simple UNION ALL.  If that's not the case here, I'd want to know why.
> >
> >                         regards, tom lane
> >
> 
> Updated commitfest entry to "Returned With Feedback".

Sorry for the absense. For information, I'll continue to write
some more.

At Tue, 14 Mar 2017 10:28:36 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<e7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp>
> async
>   t0: 3806.31 (  4.49)    0.4% faster (should be error)
>   pl: 1629.17 (  0.29)    1.3% slower
>  pf0: 6447.07 ( 25.19)   18.7% faster
>  pf1: 1876.80 ( 47.13)   76.6% faster
> </quote>
> 
> IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
> same server) measured with different implementations of the patch.

pf0 is measured on a partitioned(sharded) tables on one foreign
server, that is, sharing a connection. pf1 is in contrast sharded
tables that have dedicate server (or connection). The parent
server is async-patched and the child server is not patched.

Async-capable plan is generated in planner. An Append contains at
least one async-capable child becomes async-aware Append. So the
async feature should be effective also for the UNION ALL case.

The following will work faster than unpatched version.I 

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a
FROMft40) as ft;
 

I'll measure the performance for the case next week.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello. This is the final report in this CF period.

At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170317.173505.152063931.horiguchi.kyotaro@lab.ntt.co.jp>
> Async-capable plan is generated in planner. An Append contains at
> least one async-capable child becomes async-aware Append. So the
> async feature should be effective also for the UNION ALL case.
> 
> The following will work faster than unpatched version.I 
> 
> SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a
FROMft40) as ft;
 
> 
> I'll measure the performance for the case next week.

I found that the following query works as the same as partitioned
table.

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a
FROMft40 UNION ALL *SELECT a FROM ONLY pf0*) as ft;
 

So, the difference comes from the additional async-uncapable
child (faster if contains any). In both cases, Append node runs
children asynchronously but slightly differently when all
async-capable children are busy.

I'll continue working on this from this point aiming to the next
commit fest.

Thank you for valuable feedback.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: asynchronous execution

From
Corey Huinker
Date:

I'll continue working on this from this point aiming to the next
commit fest.


This probably will not surprise you given the many commits in the past 2 weeks, but the patches no longer apply to master:

 $ git apply ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27: trailing whitespace.
        FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39: trailing whitespace.
#include "utils/resowner_private.h"
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47: trailing whitespace.
        ResourceOwner   resowner;       /* Resource owner */
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48: trailing whitespace.

/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57: trailing whitespace.
        WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
error: patch failed: src/backend/libpq/pqcomm.c:201
error: src/backend/libpq/pqcomm.c: patch does not apply
error: patch failed: src/backend/storage/ipc/latch.c:61
error: src/backend/storage/ipc/latch.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/condition_variable.c:66
error: src/backend/storage/lmgr/condition_variable.c: patch does not apply
error: patch failed: src/backend/utils/resowner/resowner.c:124
error: src/backend/utils/resowner/resowner.c: patch does not apply
error: patch failed: src/include/storage/latch.h:101
error: src/include/storage/latch.h: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:18
error: src/include/utils/resowner_private.h: patch does not apply

Re: asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in
<CADkLM=dN_vt8kazOoiVOfjN6xFHpzf5uiGJz+iN+f4fLbYwSKA@mail.gmail.com>
> >
> >
> > I'll continue working on this from this point aiming to the next
> > commit fest.
> >
> >
> This probably will not surprise you given the many commits in the past 2
> weeks, but the patches no longer apply to master:

Yeah, I won't surprise by that but thank you for noticing
me. Greately reduces the difficulty of merging. Thank you.

>  $ git apply
> ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
> /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
> trailing whitespace.

Maybe the patch was retrieved on Windows then transferred to
Linux box. Converting EOLs of the files or some git configuration
might save that. (git am has --no-keep-cr but I haven't find that
for git apply)

The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello.

At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170404.192539.29699823.horiguchi.kyotaro@lab.ntt.co.jp>
> The attached patch is rebased on the current master, but no
> substantial changes other than disallowing partitioned tables on
> async by assertion.

This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170522.131214.20936668.horiguchi.kyotaro@lab.ntt.co.jp>
> > The attached patch is rebased on the current master, but no
> > substantial changes other than disallowing partitioned tables on
> > async by assertion.
> 
> This is just rebased onto the current master (d761fe2).
> I'll recheck further detail after this.

Sorry, the patch was missing some files to add.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

> The attached patch is rebased on the current master, but no
> substantial changes other than disallowing partitioned tables on
> async by assertion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Antonin Houska
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] asynchronous execution

From
Antonin Houska
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you for looking this.

At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska <ah@cybertec.at> wrote in <4579.1498638234@localhost>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 
> > The patch got conflicted. This is a new version just rebased to
> > the current master. Furtuer amendment will be taken later.
> 
> Can you please explain this part of make_append() ?
> 
> /* Currently async on partitioned tables is not available */
> Assert(nasyncplans == 0 || partitioned_rels == NIL);
> 
> I don't think the output of Append plan is supposed to be ordered even if the
> underlying relation is partitioned. Besides ordering, is there any other
> reason not to use the asynchronous execution?

It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.

> And even if there was some, the planner should ensure that executor does not
> fire the assertion statement above. The script attached shows an example how
> to cause the assertion failure.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Amit Langote
Date:
Hi,

On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
> Thank you for looking this.
> 
> At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>
>>> The patch got conflicted. This is a new version just rebased to
>>> the current master. Furtuer amendment will be taken later.
>>
>> Can you please explain this part of make_append() ?
>>
>> /* Currently async on partitioned tables is not available */
>> Assert(nasyncplans == 0 || partitioned_rels == NIL);
>>
>> I don't think the output of Append plan is supposed to be ordered even if the
>> underlying relation is partitioned. Besides ordering, is there any other
>> reason not to use the asynchronous execution?
> 
> It was just a developmental sentinel that will remind me later to
> consider the declarative partitions since I didn't have an idea
> of the differences (or the similarity) between appendrels and
> partitioned_rels. It is never to say the condition cannot
> make. I'll check it out and will support partitioned_rels sooner.
> Sorry for having left it as it is.

When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root.  'appendplans' is a list of subplans for
scanning the leaf partitions in the tree.  Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).

The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.

Thanks,
Amit




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hi, I've returned.

At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<63a5a01c-2967-83e0-8bbf-c981404f529e@lab.ntt.co.jp>
> Hi,
> 
> On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
> > Thank you for looking this.
> > 
> > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
> >> Can you please explain this part of make_append() ?
> >>
> >> /* Currently async on partitioned tables is not available */
> >> Assert(nasyncplans == 0 || partitioned_rels == NIL);
> >>
> >> I don't think the output of Append plan is supposed to be ordered even if the
> >> underlying relation is partitioned. Besides ordering, is there any other
> >> reason not to use the asynchronous execution?
> 
> When making an Append for a partitioned table, among the arguments passed
> to make_append(), 'partitioned_rels' is a list of RT indexes of
> partitioned tables in the inheritance tree of which the aforementioned
> partitioned table is the root.  'appendplans' is a list of subplans for
> scanning the leaf partitions in the tree.  Note that the 'appendplans'
> list contains no members corresponding to the partitioned tables, because
> we don't need to scan them (only leaf relations contain any data).
> 
> The point of having the 'partitioned_rels' list in the resulting Append
> plan is so that the executor can identify those relations and take the
> appropriate locks on them.

Amit, thank you for the detailed explanation. I understand what
it is and that just ignoring it is enough, then confirmed that
actually works as before.

I'll then adresss Antonin's comments tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you for the thought.

This is at PoC level so I'd be grateful for this kind of
fundamental comments.

At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska <ah@cybertec.at> wrote in <392.1498674144@localhost>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 
> > The patch got conflicted. This is a new version just rebased to
> > the current master. Furtuer amendment will be taken later.
> 
> Just one idea that I had while reading the code.
> 
> In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> complete requests to the end and finaly adjust estate->es_num_pending_async so
> that the array no longer contains the complete requests. I think the point is
> that then you can add new requests to the end of the array.
> 
> I wonder if a set (Bitmapset) of incomplete requests would make the code more
> efficient. The set would contain position of each incomplete request in
> estate->es_num_pending_async (I think it's the myindex field of
> PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> requests subject to ExecAsyncNotify etc, then the compaction of
> estate->es_pending_async wouldn't be necessary.
> 
> ExecAsyncRequest would use the set to look for space for new requests by
> iterating it and trying to find the first gap (which corresponds to completed
> request).
> 
> And finally, item would be removed from the set at the moment the request
> state is being set to ASYNCREQ_COMPLETE.

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

Sorry for the discursive discussion but in short, I have noticed
that I have a lot to consider on this:p Thanks!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Antonin Houska
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> > Just one idea that I had while reading the code.
> >
> > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> > complete requests to the end and finaly adjust estate->es_num_pending_async so
> > that the array no longer contains the complete requests. I think the point is
> > that then you can add new requests to the end of the array.
> >
> > I wonder if a set (Bitmapset) of incomplete requests would make the code more
> > efficient. The set would contain position of each incomplete request in
> > estate->es_num_pending_async (I think it's the myindex field of
> > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> > requests subject to ExecAsyncNotify etc, then the compaction of
> > estate->es_pending_async wouldn't be necessary.
> >
> > ExecAsyncRequest would use the set to look for space for new requests by
> > iterating it and trying to find the first gap (which corresponds to completed
> > request).
> >
> > And finally, item would be removed from the set at the moment the request
> > state is being set to ASYNCREQ_COMPLETE.
>
> Effectively it is a waiting-queue followed by a
> completed-list. The point of the compaction is keeping the order
> of waiting or not-yet-completed requests, which is crucial to
> avoid kind-a precedence inversion. We cannot keep the order by
> using bitmapset in such way.

> The current code waits all waiters at once and processes all
> fired events at once. The order in the waiting-queue is
> inessential in the case. On the other hand I suppoese waiting on
> several-tens to near-hundred remote hosts is in a realistic
> target range. Keeping the order could be crucial if we process a
> part of the queue at once in the case.
>
> Putting siginificance on the deviation of response time of
> remotes, process-all-at-once is effective. In turn we should
> consider the effectiveness of the lifecycle of the larger wait
> event set.

ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?

    /* If any node completed, compact the array. */    if (any_node_done)    {        int        hidx = 0,
 tidx; 
        /*         * Swap all non-yet-completed items to the start of the array.         * Keep them in the same order.
       */        for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)        {            PendingAsyncRequest
*tail= estate->es_pending_async[tidx]; 
            Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
            if (tail->state == ASYNCREQ_COMPLETE)                continue;
            /*             * If the array starts with one or more incomplete requests,             * both head and tail
pointat the same item, so there's no             * point in swapping.             */            if (tidx > hidx)
   {                PendingAsyncRequest *head = estate->es_pending_async[hidx]; 
                /*                 * Once the tail got ahead, it should only leave                 * ASYNCREQ_COMPLETE
behind.Only those can then be seen                 * by head.                 */                Assert(head->state ==
ASYNCREQ_COMPLETE);
                estate->es_pending_async[tidx] = head;                estate->es_pending_async[hidx] = tail;
}
            ++hidx;        }
        estate->es_num_pending_async = hidx;    }

And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.

--
Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener
Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska <ah@cybertec.at> wrote in <6448.1499761731@localhost>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Effectively it is a waiting-queue followed by a
> > completed-list. The point of the compaction is keeping the order
> > of waiting or not-yet-completed requests, which is crucial to
> > avoid kind-a precedence inversion. We cannot keep the order by
> > using bitmapset in such way.
> 
> > The current code waits all waiters at once and processes all
> > fired events at once. The order in the waiting-queue is
> > inessential in the case. On the other hand I suppoese waiting on
> > several-tens to near-hundred remote hosts is in a realistic
> > target range. Keeping the order could be crucial if we process a
> > part of the queue at once in the case.
> > 
> > Putting siginificance on the deviation of response time of
> > remotes, process-all-at-once is effective. In turn we should
> > consider the effectiveness of the lifecycle of the larger wait
> > event set.
> 
> ok, I missed the fact that the order of es_pending_async entries is
> important. I think this is worth adding a comment.

I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.

> Actually the reason I thought of simplification was that I noticed small
> inefficiency in the way you do the compaction. In particular, I think it's not
> always necessary to swap the tail and head entries. Would something like this
> make sense?

I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.

> 
>         /* If any node completed, compact the array. */
>         if (any_node_done)
>         {
...
>             for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
>             {
...
>                 if (tail->state == ASYNCREQ_COMPLETE)
>                     continue;
> 
>                 /*
>                  * If the array starts with one or more incomplete requests,
>                  * both head and tail point at the same item, so there's no
>                  * point in swapping.
>                  */
>                 if (tidx > hidx)
>                 {

This works to skip the first several elements when all of them
are ASYNCREQ_COMPLETE. I think it makes sense as long as it
doesn't harm the loop. The optimization is more effective by
putting out of the loop like this.

|  for (tidx = 0; tidx < estate->es_num_pending_async &&                 estate->es_pending_async[tidx] ==
ASYNCREQ_COMPLETE;++tidx);
 
|  for (; tidx < estate->es_num_pending_async; ++tidx)
...


> And besides that, I think it'd be more intuitive if the meaning of "head" and
> "tail" was reversed: if the array is iterated from lower to higher positions,
> then I'd consider head to be at higher position, not tail.

Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something.  It might be less
confusing by rewriting it in more verbose-but-straightforwad way.


|      int npending = 0;
| 
|      /* Skip over not-completed items at the beginning */
|      while (npending < estate->es_num_pending_async &&
|             estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
|        npending++;
| 
|      /* Scan over the rest for not-completed items */
|      for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
|      {
|        PendingAsyncRequest *tmp;
|        PendingAsyncRequest *curr = estate->es_pending_async[i];
|
|        if (curr->state == ASYNCREQ_COMPLETE)
|          continue;
|
|     /* Move the not-completed item to the tail of the first chunk */
|        tmp = estate->es_pending_async[i];
|        estate->es_pending_async[nepending] = tmp;
|        estate->es_pending_async[i] = tmp;
|        ++npending;
|      }


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase
this and added a patch to refactor the function that Anotonin
pointed. This would be merged into 0002 patch.

At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170718.162452.221576658.horiguchi.kyotaro@lab.ntt.co.jp>
> I'll put an upper limit to the number of waiters processed at
> once. Then add a comment like that.
> 
> > Actually the reason I thought of simplification was that I noticed small
> > inefficiency in the way you do the compaction. In particular, I think it's not
> > always necessary to swap the tail and head entries. Would something like this
> > make sense?
> 
> I'm not sure, but I suppose that it is rare that all of the first
> many elements in the array are not COMPLETE. In most cases the
> first element gets a response first.
...
> Yeah, but maybe the "head" is still confusing even if reversed
> because it is still not a head of something.  It might be less
> confusing by rewriting it in more verbose-but-straightforwad way.
> 
> 
> |      int npending = 0;
> | 
> |      /* Skip over not-completed items at the beginning */
> |      while (npending < estate->es_num_pending_async &&
> |             estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
> |        npending++;
> | 
> |      /* Scan over the rest for not-completed items */
> |      for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
> |      {
> |        PendingAsyncRequest *tmp;
> |        PendingAsyncRequest *curr = estate->es_pending_async[i];
> |
> |        if (curr->state == ASYNCREQ_COMPLETE)
> |          continue;
> |
> |     /* Move the not-completed item to the tail of the first chunk */
> |        tmp = estate->es_pending_async[i];
> |        estate->es_pending_async[nepending] = tmp;
> |        estate->es_pending_async[i] = tmp;
> |        ++npending;
> |      }

The last patch does something like this (with apparent bugs
fixed)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Robert Haas
Date:
On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>  [ new patches ]

I spent some time today refreshing my memory of what's going with this
thread today.

Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called.  Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit.  As far as it goes, that is good.

However, at a deeper level, I fear we haven't really solved the
problem.  If an Append is directly on top of a ForeignScan node, then
this will work.  But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable.  Indeed, we'd really want all kinds of
joins and aggregates to be async-capable so that examples like the one
Corey asked about in
http://postgr.es/m/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=Sb4Q5rJsRA@mail.gmail.com
will work.

But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same.  Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
it seems like that will either end up duplicating a lot of code from
the regular code path or, alternatively, polluting the regular code
path with some of the async code's concerns to avoid duplication, and
maybe slowing things down.

Maybe that concern is unjustified; I'm not sure.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] asynchronous execution

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Ostensibly, the advantage of this framework over my previous proposal
> is that it avoids inserting anything into ExecProcNode(), which is
> probably a good thing to avoid given how frequently ExecProcNode() is
> called.  Unless the parent and the child both know about asynchronous
> execution and choose to use it, everything runs exactly as it does
> today and so there is no possibility of a complaint about a
> performance hit.  As far as it goes, that is good.

> However, at a deeper level, I fear we haven't really solved the
> problem.  If an Append is directly on top of a ForeignScan node, then
> this will work.  But if an Append is indirectly on top of a
> ForeignScan node with some other stuff in the middle, then it won't -
> unless we make whichever nodes appear between the Append and the
> ForeignScan async-capable.

I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed.  I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.
        regards, tom lane



Re: [HACKERS] asynchronous execution

From
Robert Haas
Date:
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I have not been paying any attention to this thread whatsoever,
> but I wonder if you can address your problem by building on top of
> the ExecProcNode replacement that Andres is working on,
> https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de
>
> The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> no cost when $extra_stuff is not needed, because you simply don't insert
> the wrapper function when it's not needed.  I'm not sure that it will
> scale well to several different kinds of insertions though, for instance
> if you wanted both instrumentation and async support on the same node.
> But maybe those two couldn't be arms-length from each other anyway,
> in which case it might be fine as-is.

Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you for the comment.

At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>
> But if we do, then I fear we'll just be reintroducing the same
> performance regression that we introduced by switching to this
> framework from the previous one - or maybe a different one, but a
> regression all the same.  Every type of intermediate node will have to
> have a code path where it uses ExecAsyncRequest() /
> ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and

I understand what Robert concerns and I think I share the same
opinion. It needs further different framework.

At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>
> On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I have not been paying any attention to this thread whatsoever,
> > but I wonder if you can address your problem by building on top of
> > the ExecProcNode replacement that Andres is working on,
> > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de
> >
> > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > no cost when $extra_stuff is not needed, because you simply don't insert
> > the wrapper function when it's not needed.  I'm not sure that it will
> > scale well to several different kinds of insertions though, for instance
> > if you wanted both instrumentation and async support on the same node.
> > But maybe those two couldn't be arms-length from each other anyway,
> > in which case it might be fine as-is.
> 
> Yeah, I don't quite see how that would apply in this case -- what we
> need here is not as simple as just conditionally injecting an extra
> bit.

Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp>
> Thank you for the comment.
> 
> At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>
> > regression all the same.  Every type of intermediate node will have to
> > have a code path where it uses ExecAsyncRequest() /
> > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
> 
> I understand what Robert concerns and I share the same
> opinion. It needs further different framework.
> 
> At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>
> > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > > no cost when $extra_stuff is not needed, because you simply don't insert
> > > the wrapper function when it's not needed.  I'm not sure that it will
...
> > Yeah, I don't quite see how that would apply in this case -- what we
> > need here is not as simple as just conditionally injecting an extra
> > bit.
> 
> Thank you for the pointer, Tom. The subject (segfault in HEAD...)
> haven't made me think that this kind of discussion was held
> there. Anyway it seems very closer to asynchronous execution so
> I'll catch up it considering how I can associate with this.

I understand the executor change which has just been made at
master based on the pointed thread. This seems to have the
capability to let exec-node switch to async-aware with no extra
cost on non-async processing. So it would be doable to (just)
*shrink* the current framework by detaching the async-aware side
of the API. But to get the most out of asynchrony, it is required
that multiple async-capable nodes distributed under async-unaware
nodes run simultaneously.

There seems two ways to achieve this.

One is propagating required-async-nodes bitmap up to the topmost
node and waiting for the all required nodes to become ready. In
the long run this requires all nodes to be async-aware and that
apparently would have bad effect on performance to async-unaware
nodes containing async-capable nodes.

Another is getting rid of recursive call to run an execution
tree. It is perhaps the same to what mentioned as "data-centric
processing" in a previous threads [1], [2], but I'd like to I pay
attention on the aspect of "enableing to resume execution tree
from arbitrary leaf node".  So I'm considering to realize it
still in one-tuple-by-one manner instead of collecting all tuples
of a leaf node first. Even though I'm not sure it is doable.


[1] https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
[2] https://www.postgresql.org/message-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Robert Haas
Date:
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Another is getting rid of recursive call to run an execution
> tree.

That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here.  Unfortunately, that's
probably another gigantic patch (that should probably be written by
Andres).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Thank you for the comment.

At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmobbZrBPb7cvFj3ACPX2A_qSEB4ughRmB5dkGPXUYx_E+Q@mail.gmail.com>
> On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Another is getting rid of recursive call to run an execution
> > tree.
> 
> That happens to be exactly what Andres did for expression evaluation
> in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
> generalizing that to include the plan tree as well as expression trees
> is likely to be the long-term way forward here.

I read it in the source tree. The patch implements converting
expression tree to an intermediate expression then run it on a
custom-made interpreter. Guessing from the word "upside down"
from Andres, the whole thing will become source-driven.

> Unfortunately, that's probably another gigantic patch (that
> should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the past of the thread around but not found)
 This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver) nodes to the root.
 I'm not sure this is included, but I suppose he is aiming this kind of thing.

- Rebuid asynchronous execution on the upside-down executor.


regrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>
> > Unfortunately, that's probably another gigantic patch (that
> > should probably be written by Andres).
> 
> Yeah, but async executor on the current style of executor seems
> furtile work, or sitting until the patch comes is also waste of
> time. So I'm planning to include the following sutff in the next
> PoC patch. Even I'm not sure it can land on the coming
> Andres'patch.
> 
> - Tuple passing outside call-stack. (I remember it was in the
>   past of the thread around but not found)
> 
>   This should be included in the Andres' patch.
> 
> - Give executor an ability to run from data-source (or driver)
>   nodes to the root.
> 
>   I'm not sure this is included, but I suppose he is aiming this
>   kind of thing.
> 
> - Rebuid asynchronous execution on the upside-down executor.

Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.

I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170831.215236.135328985.horiguchi.kyotaro@lab.ntt.co.jp>
> At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > Unfortunately, that's probably another gigantic patch (that
> > > should probably be written by Andres).
> > 
> > Yeah, but async executor on the current style of executor seems
> > furtile work, or sitting until the patch comes is also waste of
> > time. So I'm planning to include the following sutff in the next
> > PoC patch. Even I'm not sure it can land on the coming
> > Andres'patch.
> > 
> > - Tuple passing outside call-stack. (I remember it was in the
> >   past of the thread around but not found)
> > 
> >   This should be included in the Andres' patch.
> > 
> > - Give executor an ability to run from data-source (or driver)
> >   nodes to the root.
> > 
> >   I'm not sure this is included, but I suppose he is aiming this
> >   kind of thing.
> > 
> > - Rebuid asynchronous execution on the upside-down executor.
> 
> Anyway, I modified ExecProcNode into push-up form and it *seems*
> working to some extent. But trigger and cursors are almost broken
> and several other regressions fail. Some nodes such like
> windowagg are terriblly difficult to change to this push-up form
> (using state machine). And of course it is terribly inefficient.
> 
> I'm afraid that all of this turns out to be in vain. But anyway,
> and FWIW, I'll show the work to here after some cleansing work on
> it.

So, this is that. Maybe this is really a bad way to go. Top of
the bads is it's terriblly hard to maintain because the behavior
of the state machine constructed in this patch is hardly
predictable so easily broken. During the 'cleansing work' I had
many crash or infinite-loop and they were a bit hard to
diagnose.. This will be soon broken by following commits.

Anyway and, again FWIW, this is that. I'll leave this for a while
(at least the period of this CF) and reconsider on async in
different forms.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello.

Fully-asynchronous executor needs that every node is stateful and
suspendable at the time of requesting for the next tuples to
underneath nodes. I tried pure push-base executor but failed.

After the miserable patch upthread, I finally managed to make
executor nodes suspendable using computational jump and got rid
of recursive calls of executor. But it runs about x10 slower for
simple SeqScan case. (pgbench ran with 9% degradation.) It
doesn't seem recoverable by handy improvements. So I gave up
that.

Then I returned to single-level asynchrony, in other words, the
simple case with async-aware nodes just above async-capable
nodes. The motive of using the framework in the previous patch
was that we had degradation on the sync (or normal) code paths by
polluting ExecProcNode with async stuff and as Tom's suggestion
the node->ExecProcNode trick can isolate the async code path.

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases. (Additional
members in PlanState made degradation seemingly comes from
alignment, though.)

But I haven't had enough stable result from performance
test. Different builds from the same source code gives apparently
different results...

Anyway I'll show the best one in the several times run here.
                          original(ms) patched(ms)    gain(%)
A: simple table scan     :  9714.70      9656.73         0.6
B: local partitioning    :  4119.44      4131.10        -0.3
C: single remote table   :  9484.86      9141.89         3.7
D: sharding (single con) :  7114.34      6751.21         5.1
E: sharding (multi con)  :  7166.56      1827.93        74.5

A and B are degradation checks, which are expected to show no
degradation.  C is the gain only by postgres_fdw's command
presending on a remote table.  D is the gain of sharding on a
connection. The number of partitions/shards is 4.  E is the gain
using dedicate connection per shard.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---src/backend/libpq/pqcomm.c                    |  2 +-src/backend/storage/ipc/latch.c               | 18
++++++-src/backend/storage/lmgr/condition_variable.c|  2 +-src/backend/utils/resowner/resowner.c         | 68
+++++++++++++++++++++++++++src/include/storage/latch.h                  |  4 +-src/include/utils/resowner_private.h
    |  8 ++++6 files changed, 97 insertions(+), 5 deletions(-)
 

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 754154b..d459f32 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)                (errmsg("could not set socket to nonblocking mode: %m")));#endif
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);    AddWaitEventToSet(FeBeWaitSet,
WL_SOCKET_WRITEABLE,MyProcPort->sock,                      NULL, NULL);    AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET,
-1,MyLatch, NULL);
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@#include "storage/latch.h"#include "storage/pmsignal.h"#include "storage/shmem.h"
+#include "utils/resowner_private.h"/* * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet    int            nevents;        /* number of registered events */    int
   nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+    /*     * Array, of nevents_space length, storing the definition of events this     * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,    int            ret = 0;
 int            rc;    WaitEvent    event;
 
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);    if (wakeEvents & WL_TIMEOUT)
Assert(timeout>= 0);
 
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents){    WaitEventSet *set;    char       *data;
 Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+    /*     * Use MAXALIGN size/alignment to guarantee that later uses of memory are     * aligned correctly. E.g.
epoll_eventmight need 8 byte alignment on some
 
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)    StaticAssertStmt(WSA_INVALID_EVENT ==
NULL,"");#endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+    return set;}
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)    }#endif
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+    pfree(set);}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)    /* Create a reusable WaitEventSet. */    if
(cv_wait_event_set== NULL)    {
 
-        cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+        cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);        AddWaitEventToSet(cv_wait_event_set,
WL_LATCH_SET,PGINVALID_SOCKET,                          MyLatch, NULL);    }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bd19fad..d36481e 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData    ResourceArray snapshotarr;    /* snapshot references */
ResourceArrayfilearr;        /* open temporary files */    ResourceArray dsmarr;        /* dynamic shmem segments */
 
+    ResourceArray wesarr;        /* wait event sets */    /* We can remember up to MAX_RESOWNER_LOCKS references to
locallocks. */    int            nlocks;            /* number of owned locks */
 
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);static void
PrintSnapshotLeakWarning(Snapshotsnapshot);static void PrintFileLeakWarning(File file);static void
PrintDSMLeakWarning(dsm_segment*seg);
 
+static void PrintWESLeakWarning(WaitEventSet
*events);/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr),PointerGetDatum(NULL));    ResourceArrayInit(&(owner->filearr),
FileGetDatum(-1));   ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
 
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));    return owner;}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,                PrintDSMLeakWarning(res);
  dsm_detach(res);        }
 
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }    }    else if (phase == RESOURCE_RELEASE_LOCKS)    {
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)    Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems== 0);    Assert(owner->dsmarr.nitems == 0);
 
+    Assert(owner->wesarr.nitems == 0);    Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)    ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));   ResourceArrayFree(&(owner->dsmarr));
 
+    ResourceArrayFree(&(owner->wesarr));    pfree(owner);}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)    elog(WARNING, "dynamic shared memory leak: segment %u
stillreferenced",         dsm_segment_handle(seg));}
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a43193c..997ee8d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@#define LATCH_H#include <signal.h>
+#include "utils/resowner.h"/* * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);extern void SetLatch(volatile Latch *latch);extern
voidResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);extern void FreeWaitEventSet(WaitEventSet
*set);externint AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,                  Latch *latch, void
*user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@#include "storage/dsm.h"#include "storage/fd.h"
+#include "storage/latch.h"#include "storage/lock.h"#include "utils/catcache.h"#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,extern void
ResourceOwnerForgetDSM(ResourceOwnerowner,                       dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+#endif                            /* RESOWNER_PRIVATE_H */
-- 
2.9.2

From 1b213d238c398dc77cb31cf2a92284c70d292e9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---src/backend/executor/Makefile           |   2 +-src/backend/executor/execAsync.c        | 110
++++++++++++++++++src/backend/executor/nodeAppend.c      | 194
++++++++++++++++++++++++++++++--src/backend/executor/nodeForeignscan.c |  22
+++-src/backend/optimizer/plan/createplan.c|  56 ++++++++-src/backend/postmaster/pgstat.c         |   3
+src/include/executor/execAsync.h       |  23 ++++src/include/executor/executor.h         |   1
+src/include/executor/nodeForeignscan.h |   3 +src/include/foreign/fdwapi.h            |  11
++src/include/nodes/execnodes.h          |  18 ++-src/include/nodes/plannodes.h           |   2 +src/include/pgstat.h
                |   3 +-13 files changed, 428 insertions(+), 20 deletions(-)create mode 100644
src/backend/executor/execAsync.ccreatemode 100644 src/include/executor/execAsync.h
 

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executortop_builddir = ../../..include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \       execGrouping.o execIndexing.o
execJunk.o\       execMain.o execParallel.o execProcnode.o \       execReplication.o execScan.o execSRF.o execTuples.o
\
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+            refind = (int *) palloc(refindsize * sizeof(int));
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bed9bb8..5355bb2 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,9 +59,11 @@#include "executor/execdebug.h"#include "executor/nodeAppend.h"
+#include "executor/execAsync.h"#include "miscadmin.h"static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);static bool exec_append_initialize_next(AppendState
*appendstate);
@@ -81,16 +83,16 @@ exec_append_initialize_next(AppendState *appendstate)    /*     * get information from the append
node    */
 
-    whichplan = appendstate->as_whichplan;
+    whichplan = appendstate->as_whichsyncplan;
-    if (whichplan < 0)
+    if (whichplan < appendstate->as_nasyncplans)    {        /*         * if scanning in reverse, we start at the last
scanin the list and         * then proceed back to the first.. in any case we inform ExecAppend         * that we are
atthe end of the line by returning FALSE         */
 
-        appendstate->as_whichplan = 0;
+        appendstate->as_whichsyncplan = appendstate->as_nasyncplans;        return FALSE;    }    else if (whichplan
>=appendstate->as_nplans)
 
@@ -98,7 +100,7 @@ exec_append_initialize_next(AppendState *appendstate)        /*         * as above, end the scan if
wego beyond the last scan in our list..         */
 
-        appendstate->as_whichplan = appendstate->as_nplans - 1;
+        appendstate->as_whichsyncplan = appendstate->as_nplans - 1;        return FALSE;    }    else
@@ -128,7 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)    ListCell   *lc;    /* check for
unsupportedflags */
 
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & EXEC_FLAG_MARK | EXEC_FLAG_ASYNC));    /*     * Lock the non-leaf tables in the partition tree
controlledby this node.
 
@@ -151,6 +153,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)    appendstate->ps.ExecProcNode =
ExecAppend;   appendstate->appendplans = appendplanstates;    appendstate->as_nplans = nplans;
 
+    appendstate->as_nasyncplans = node->nasyncplans;
+    appendstate->as_syncdone = (node->nasyncplans == nplans);
+    appendstate->as_asyncresult = (TupleTableSlot **)
+        palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+    /* Choose async version of Exec function */
+    if (appendstate->as_nasyncplans > 0)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    /* initially, all async requests need a request */
+    for (i = 0; i < appendstate->as_nasyncplans; ++i)
+        appendstate->as_needrequest =
+            bms_add_member(appendstate->as_needrequest, i);    /*     * Miscellaneous initialization
@@ -173,11 +188,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)    foreach(lc, node->appendplans)    {
    Plan       *initNode = (Plan *) lfirst(lc);
 
+        int            sub_eflags = eflags;
+
+        if (i < appendstate->as_nasyncplans)
+            sub_eflags |= EXEC_FLAG_ASYNC;
-        appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);        i++;    }
+    /* if there's any async-capable subnode, use async-aware routine */
+    if (appendstate->as_nasyncplans)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+    /*     * initialize output tuple type     */
@@ -187,7 +210,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags)    /*     * initialize to scan first
subplan    */
 
-    appendstate->as_whichplan = 0;
+    /*
+     * initialize to scan first synchronous subplan
+     */
+    appendstate->as_whichsyncplan = appendstate->as_nasyncplans;    exec_append_initialize_next(appendstate);
returnappendstate;
 
@@ -204,6 +230,8 @@ ExecAppend(PlanState *pstate){    AppendState *node = castNode(AppendState, pstate);
+    Assert(node->as_nasyncplans == 0);
+    for (;;)    {        PlanState  *subnode;
@@ -214,7 +242,7 @@ ExecAppend(PlanState *pstate)        /*         * figure out which subplan we are currently
processing        */
 
-        subnode = node->appendplans[node->as_whichplan];
+        subnode = node->appendplans[node->as_whichsyncplan];        /*         * get a tuple from the subplan
@@ -237,9 +265,9 @@ ExecAppend(PlanState *pstate)         * ExecInitAppend.         */        if
(ScanDirectionIsForward(node->ps.state->es_direction))
-            node->as_whichplan++;
+            node->as_whichsyncplan++;        else
-            node->as_whichplan--;
+            node->as_whichsyncplan--;        if (!exec_append_initialize_next(node))            return
ExecClearTuple(node->ps.ps_ResultTupleSlot);
@@ -247,6 +275,141 @@ ExecAppend(PlanState *pstate)    }}
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+                                       timeout);
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan in the appropriate direction. If no
+         * more subplans, return the empty slot set up for us by
+         * ExecInitAppend, unless there are async plans we have yet to finish.
+         */
+        if (ScanDirectionIsForward(node->ps.state->es_direction))
+            node->as_whichsyncplan++;
+        else
+            node->as_whichsyncplan--;
+        if (!exec_append_initialize_next(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+/* ---------------------------------------------------------------- *        ExecEndAppend *
@@ -280,6 +443,15 @@ ExecReScanAppend(AppendState *node){    int            i;
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+    for (i = 0; i < node->as_nplans; i++)    {        PlanState  *subnode = node->appendplans[i];
@@ -298,6 +470,6 @@ ExecReScanAppend(AppendState *node)        if (subnode->chgParam == NULL)
ExecReScan(subnode);   }
 
-    node->as_whichplan = 0;
+    node->as_whichsyncplan = node->as_nasyncplans;    exec_append_initialize_next(node);}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 20892d6..e851988 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)                    (ExecScanRecheckMtd) ForeignRecheck);}
-/* ---------------------------------------------------------------- *        ExecInitForeignScan *
----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)    scanstate->ss.ps.plan =
(Plan*) node;    scanstate->ss.ps.state = estate;    scanstate->ss.ps.ExecProcNode = ExecForeignScan;
 
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;    /*     * Miscellaneous initialization
@@ -388,3 +391,20 @@ ExecShutdownForeignScan(ForeignScanState *node)    if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);}
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 792ea84..53eb56d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
    Index scanrelid, char *enrname);static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
   Index scanrelid, int wtParam);
 
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *appendplans, int nasyncplans,    int referent,
+                           List *tlist, List *partitioned_rels);static RecursiveUnion *make_recursive_union(List
*tlist,                    Plan *lefttree,                     Plan *righttree,
 
@@ -283,6 +284,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,                 List *rowMarks,
OnConflictExpr*onconflict, int epqParam);static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
        GatherMergePath *best_path);
 
+static bool is_async_capable_path(Path *path);/*
@@ -1004,8 +1006,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path){    Append       *plan;    List
 *tlist = build_path_tlist(root, &best_path->path);
 
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;    ListCell   *subpaths;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;    /*     * The subpaths list could be empty, if every child was proven empty
by
@@ -1040,7 +1046,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)        /* Must insist that all
childrenreturn the same tlist */        subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /* Classify as async-capable or not */
+        if (is_async_capable_path(subpath))
+        {
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;    }    /*
@@ -1050,7 +1067,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)     * parent-rel Vars it'll be asked
toemit.     */
 
-    plan = make_append(subplans, tlist, best_path->partitioned_rels);
+    plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels);    copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5281,7 +5300,8 @@ make_foreignscan(List *qptlist,}static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,    int referent,
+            List *tlist, List *partitioned_rels){    Append       *node = makeNode(Append);    Plan       *plan =
&node->plan;
@@ -5292,6 +5312,8 @@ make_append(List *appendplans, List *tlist, List *partitioned_rels)    plan->righttree = NULL;
node->partitioned_rels= partitioned_rels;    node->appendplans = appendplans;
 
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;    return node;}
@@ -6628,3 +6650,27 @@ is_projection_capable_plan(Plan *plan)    }    return true;}
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a0b49c..4c6571e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)        case WAIT_EVENT_SYNC_REP:            event_name =
"SyncRep";           break;
 
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;            /* no default case, so that compiler will warn */    }
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 37fd6b2..2ab9d72 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@#define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */#define
EXEC_FLAG_WITHOUT_OIDS   0x0040    /* force no OIDs in returned tuples */#define EXEC_FLAG_WITH_NO_DATA    0x0080    /*
relscannability doesn't matter */
 
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution *//* Hook for plugins to get control in
ExecutorStart()*/
 
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0354c2c..fed46d7 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,extern void
ExecForeignScanInitializeWorker(ForeignScanState*node,                                shm_toc *toc);extern void
ExecShutdownForeignScan(ForeignScanState*node);
 
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);#endif                            /*
NODEFOREIGNSCAN_H*/
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..566236b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,typedef List
*(*ReparameterizeForeignPathByChild_function)(PlannerInfo *root,
   List *fdw_private,                                                            RelOptInfo *child_rel);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);/* * FdwRoutine is the struct returned by a
foreign-datawrapper's handler
 
@@ -182,6 +187,7 @@ typedef struct FdwRoutine    GetForeignPlan_function GetForeignPlan;    BeginForeignScan_function
BeginForeignScan;   IterateForeignScan_function IterateForeignScan;
 
+    IterateForeignScan_function IterateForeignScanAsync;    ReScanForeignScan_function ReScanForeignScan;
EndForeignScan_functionEndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine    InitializeDSMForeignScan_function InitializeDSMForeignScan;
ReInitializeDSMForeignScan_functionReInitializeDSMForeignScan;    InitializeWorkerForeignScan_function
InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+    ShutdownForeignScan_function ShutdownForeignScan;    /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c461134..7f663eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -840,6 +840,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for
allPlanState-type nodes. * ---------------- */
 
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+typedef struct PlanState{    NodeTag        type;
@@ -880,6 +886,9 @@ typedef struct PlanState    TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
ExprContext*ps_ExprContext;    /* node's expression-evaluation context */    ProjectionInfo *ps_ProjInfo;    /* info
fordoing tuple projection */
 
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */} PlanState;/* ----------------
@@ -1003,7 +1012,13 @@ typedef struct AppendState    PlanState    ps;                /* its first field is NodeTag */
PlanState **appendplans;    /* array of PlanStates for my inputs */    int            as_nplans;
 
-    int            as_whichplan;
+    int            as_nasyncplans;    /* # of async-capable children */
+    int            as_whichsyncplan; /* which sync plan is being executed  */
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */} AppendState;/* ----------------
@@ -1546,6 +1561,7 @@ typedef struct ForeignScanState    Size        pscan_len;        /* size of parallel coordination
information*/    /* use struct pointer to avoid including fdwapi.h here */    struct FdwRoutine *fdwroutine;
 
+    bool        fs_async;    void       *fdw_state;        /* foreign-data wrapper can keep state here */}
ForeignScanState;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..e0eccc8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append    /* RT indexes of non-leaf tables in a partition tree */    List
*partitioned_rels;   List       *appendplans;
 
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */} Append;/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..fe9d39c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum    WAIT_EVENT_REPLICATION_ORIGIN_DROP,    WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT} WaitEventIPC;/* ----------
-- 
2.9.2

From 9f6a16ef7f7d1a38353216191641deb0d3ea58e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---contrib/postgres_fdw/connection.c              |  26 ++contrib/postgres_fdw/expected/postgres_fdw.out | 128
++++---contrib/postgres_fdw/postgres_fdw.c           | 484 +++++++++++++++++++++----contrib/postgres_fdw/postgres_fdw.h
          |   2 +contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-5 files changed, 522 insertions(+), 138
deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..00301d0 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry    bool        invalidated;    /* true if reconnect is pending */
uint32       server_hashvalue;    /* hash value of foreign server OID */    uint32        mapping_hashvalue;    /* hash
valueof user mapping OID */
 
+    void        *storage;        /* connection specific storage */} ConnCacheEntry;/*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)        elog(DEBUG3, "new postgres_fdw
connection%p for server \"%s\" (user mapping oid %u, userid %u)",             entry->conn, server->servername,
user->umid,user->userid);
 
+        entry->storage = NULL;    }    /*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)}/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
+/* * Connect to remote server using specified server and user mapping properties. */static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4339bbf..2a0a662 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6512,7 +6512,7 @@ INSERT INTO a(aa) VALUES('aaaaa');INSERT INTO b(aa) VALUES('bbb');INSERT INTO b(aa)
VALUES('bbbb');INSERTINTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid |  aa   ----------+------- a        | aaa
@@ -6540,7 +6540,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE
'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid |   aa   ----------+-------- a        | aaa
@@ -6568,7 +6568,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid |   aa   ----------+-------- a        | aaa
@@ -6596,7 +6596,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid |   aa   ----------+-------- a        | newtoo
@@ -6662,35 +6662,40 @@ insert into bar2 values(3,33,33);insert into bar2 values(4,44,44);insert into bar2
values(7,77,77);explain(verbose, costs off)
 
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
LockRows  Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
 
-   ->  Hash Join
+   ->  Merge Join         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
InnerUnique: true
 
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid               ->  Foreign Scan on
public.bar2                    Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
 
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1               ->  HashAggregate                     Output: foo.ctid, foo.*,
foo.tableoid,foo.f1                     Group Key: foo.f1                     ->  Append
 
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1                           ->  Foreign
Scanon public.foo2                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
            Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+----  1 | 11
@@ -6700,35 +6705,40 @@ select * from bar where f1 in (select f1 from foo) for update;(4 rows)explain (verbose, costs
off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
LockRows  Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
 
-   ->  Hash Join
+   ->  Merge Join         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
InnerUnique: true
 
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid               ->  Foreign Scan on
public.bar2                    Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
 
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1               ->  HashAggregate                     Output: foo.ctid, foo.*,
foo.tableoid,foo.f1                     Group Key: foo.f1                     ->  Append
 
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1                           ->  Foreign
Scanon public.foo2                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
            Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+----  1 | 11
@@ -6758,11 +6768,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);                     Output:
foo.ctid,foo.*, foo.tableoid, foo.f1                     Group Key: foo.f1                     ->  Append
 
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1                           ->  Foreign
Scanon public.foo2                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
            Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1   ->  Hash Join         Output:
bar2.f1,(bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid         Inner Unique: true
 
@@ -6776,11 +6786,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);                     Output:
foo.ctid,foo.*, foo.tableoid, foo.f1                     Group Key: foo.f1                     ->  Append
 
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1                           ->  Foreign
Scanon public.foo2                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
            Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1(39 rows)update bar set f2 = f2 + 100
wheref1 in (select f1 from foo);
 
@@ -6811,16 +6821,16 @@ where bar.f1 = ss.f1;         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
HashCond: (foo.f1 = bar.f1)         ->  Append
 
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1               ->  Foreign Scan on public.foo2
Output:ROW(foo2.f1), foo2.f1                     Remote SQL: SELECT f1 FROM public.loct1
 
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)               ->  Foreign Scan on public.foo2 foo2_1
                  Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)                     Remote SQL: SELECT f1 FROM
public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)         ->  Hash               Output: bar.f1,
bar.f2,bar.ctid               ->  Seq Scan on public.bar
 
@@ -6838,16 +6848,16 @@ where bar.f1 = ss.f1;               Output: (ROW(foo.f1)), foo.f1               Sort Key:
foo.f1              ->  Append
 
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1                     ->  Foreign Scan on public.foo2
             Output: ROW(foo2.f1), foo2.f1                           Remote SQL: SELECT f1 FROM public.loct1
 
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)                     ->  Foreign Scan on
public.foo2foo2_1                           Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
RemoteSQL: SELECT f1 FROM public.loct1
 
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)(45 rows)update bar set f2 = f2 + 100
@@ -6998,27 +7008,33 @@ delete from foo where f1 < 5 returning *;(5 rows)explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2  ----+-----  1 | 311  2 |
322
-  6 | 266  3 | 333  4 | 344
+  6 | 266  7 | 277(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..0688504 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@#include "commands/defrem.h"#include "commands/explain.h"#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"#include "foreign/fdwapi.h"#include "funcapi.h"#include "miscadmin.h"
@@ -34,6 +36,7 @@#include "optimizer/var.h"#include "optimizer/tlist.h"#include "parser/parsetree.h"
+#include "pgstat.h"#include "utils/builtins.h"#include "utils/guc.h"#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;/* If no remote estimates, assume a sort costs 20% extra */#define
DEFAULT_FDW_SORT_MULTIPLIER1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+/* * Indexes of FDW-private information stored in fdw_private lists. *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex};/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState *current_owner;    /* The node currently running a query
+                                         * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
+/* * Execution state of a foreign scan using postgres_fdw. */typedef struct PgFdwScanState{
+    PgFdwState    s;                /* common structure */    Relation    rel;            /* relcache entry for the
foreigntable. NULL                                 * for a foreign join scan. */    TupleDesc    tupdesc;        /*
tupledescriptor of scan */
 
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState    List       *retrieved_attrs;    /* list of retrieved attribute
numbers*/    /* for remote query execution */
 
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;    unsigned int cursor_number; /* quasi-unique ID for my cursor */    bool
cursor_exists;   /* have we created the cursor? */    int            numParams;        /* number of parameters passed
toquery */
 
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState    /* batch-level state, for optimizing rewinds and avoiding useless
fetch*/    int            fetch_ct_2;        /* Min(# of fetches done, 2) */    bool        eof_reached;    /* true if
lastfetch reached EOF */
 
+    bool        run_async;        /* true if run asynchronously */
+    bool        async_waiting;    /* true if requesting the parent to wait */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* A waiting node at the end of a waiting
+                                 * list. Maintained only by the current
+                                     * owner of the connection */    /* working memory contexts */    MemoryContext
batch_cxt;   /* context holding current batch of tuples */
 
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState */typedef struct PgFdwModifyState{
+    PgFdwState    s;                /* common structure */    Relation    rel;            /* relcache entry for the
foreigntable */    AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */    /* for remote query
execution*/
 
-    PGconn       *conn;            /* connection for the scan */    char       *p_name;            /* name of prepared
statement,if created */    /* extracted fdw_private data */
 
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState */typedef struct PgFdwDirectModifyState{
+    PgFdwState    s;                /* common structure */    Relation    rel;            /* relcache entry for the
foreigntable */    AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);static TupleTableSlot
*postgresIterateForeignScan(ForeignScanState*node);static void postgresReScanForeignScan(ForeignScanState *node);static
voidpostgresEndForeignScan(ForeignScanState *node);
 
+static void postgresShutdownForeignScan(ForeignScanState *node);static void postgresAddForeignUpdateTargets(Query
*parsetree,                               RangeTblEntry *target_rte,                                Relation
target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKindstage,                             RelOptInfo *input_rel,                             RelOptInfo
*output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);/* * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass*ec, EquivalenceMember *em,                          void *arg);static void
create_cursor(ForeignScanState*node);
 
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);static void close_cursor(PGconn *conn, unsigned int
cursor_number);staticvoid prepare_foreign_modify(PgFdwModifyState *fmstate);static const char
**convert_prep_stmt_params(PgFdwModifyState*fmstate,
 
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)    routine->IterateForeignScan = postgresIterateForeignScan;
  routine->ReScanForeignScan = postgresReScanForeignScan;    routine->EndForeignScan = postgresEndForeignScan;
 
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;    /* Functions for updating foreign tables */
routine->AddForeignUpdateTargets= postgresAddForeignUpdateTargets;
 
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)    /* Support functions for upper relation push-down */
routine->GetForeignUpperPaths= postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+    PG_RETURN_POINTER(routine);}
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)     * Get connection to the
foreignserver.  Connection manager will     * establish new connection if necessary.     */
 
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->current_owner = NULL;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;    /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);    fsstate->cursor_exists = false;
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->async_waiting = false;
+    /* Get private info created by planner functions. */    fsstate->query = strVal(list_nth(fsplan->fdw_private,
                              FdwScanPrivateSelectSql));
 
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)static TupleTableSlot
*postgresIterateForeignScan(ForeignScanState*node){
 
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);    TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*     * Get some more tuples, if we've run out.     */    if (fsstate->next_tuple >= fsstate->num_tuples)    {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
+        ForeignScanState *next_conn_owner = node;
+
+        /* This node has sent a query on this connection */
+        if (fsstate->s.connpriv->current_owner == node)
+        {
+            /* Check if the result is available */
+            if (PQisBusy(fsstate->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(fsstate->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+                {
+                    /*
+                     * This node is not ready yet. Tell the caller to wait.
+                     */
+                    fsstate->result_ready = false;
+                    node->ss.ps.asyncstate = AS_WAITING;
+                    return ExecClearTuple(slot);
+                }
+            }
+
+            Assert(fsstate->async_waiting);
+            fsstate->async_waiting = false;
+            fetch_received_data(node);
+
+            /*
+             * If someone is waiting this node on the same connection, let the
+             * first waiter be the next owner of this connection.
+             */
+            if (fsstate->waiter)
+            {
+                PgFdwScanState *next_owner_state;
+
+                next_conn_owner = fsstate->waiter;
+                next_owner_state = GetPgFdwScanState(next_conn_owner);
+                fsstate->waiter = NULL;
+
+                /*
+                 * only the current owner is responsible to maintain the shortcut
+                 * to the last waiter
+                 */
+                next_owner_state->last_waiter = fsstate->last_waiter;
+
+                /*
+                 * for simplicity, last_waiter points itself on a node that no one
+                 * is waiting for.
+                 */
+                fsstate->last_waiter = node;
+            }
+        }
+        else if (fsstate->s.connpriv->current_owner &&
+                 !GetPgFdwScanState(node)->eof_reached)
+        {
+            /*
+             * Anyone else is holding this connection and we want this node to
+             * run later. Add myself to the tail of the waiters' list then
+             * return not-ready.  To avoid scanning through the waiters' list,
+             * the current owner is to maintain the shortcut to the last
+             * waiter.
+             */
+            PgFdwScanState *conn_owner_state =
+                GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+            ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+            PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+            last_waiter_state->waiter = node;
+            conn_owner_state->last_waiter = node;
+
+            /* Register the node to the async-waiting node list */
+            Assert(!GetPgFdwScanState(node)->async_waiting);
+
+            GetPgFdwScanState(node)->async_waiting = true;
+
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+            return ExecClearTuple(slot);
+        }
+
+        /* At this time no node is running on the connection */
+        Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+               == NULL);
+        /*
+         * Send the next request for the next owner of this connection if
+         * needed.
+         */
+        if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+        {
+            PgFdwScanState *next_owner_state =
+                GetPgFdwScanState(next_conn_owner);
+
+            request_more_data(next_conn_owner);
+
+            /* Register the node to the async-waiting node list */
+            if (!next_owner_state->async_waiting)
+                next_owner_state->async_waiting = true;
+
+            if (!next_conn_owner->fs_async)
+                fetch_received_data(next_conn_owner);
+        }
+
+
+        /*
+         * If we haven't received a result for the given node this time,
+         * return with no tuple to give way to other nodes.
+         */        if (fsstate->next_tuple >= fsstate->num_tuples)
+        {
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;            return ExecClearTuple(slot);
+        }    }    /*     * Return the next tuple.     */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;    ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,                  InvalidBuffer,
 
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)static
voidpostgresReScanForeignScan(ForeignScanState*node){
 
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);    char        sql[64];    PGresult   *res;
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)    if (!fsstate->cursor_exists)        return;
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+    /*     * If any internal parameters affecting this node have changed, we'd     * better destroy and recreate the
cursor. Otherwise, rewinding it should
 
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)     * We don't use a PG_TRY block here, so be
carefulnot to throw error     * without releasing the PGresult.     */
 
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);    if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);    PQclear(res);    /* Now force a fresh FETCH.
*/
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)static
voidpostgresEndForeignScan(ForeignScanState*node){
 
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);    /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
 
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)    /* Close the cursor if open, to prevent
accumulationof cursors */    if (fsstate->cursor_exists)
 
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);    /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;    /* MemoryContexts will be deleted automatically. */}/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+}
+
+/* * postgresAddForeignUpdateTargets *        Add resjunk column(s) needed for update/delete on a foreign table */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,    user = GetUserMapping(userid,
table->serverid);   /* Open connection; report that we'll create a prepared statement. */
 
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));    fmstate->p_name = NULL;        /* prepared
statementnot made yet */    /* Deconstruct fdw_private data. */
 
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,    PGresult   *res;    int            n_rows;
+    vacate_connection((PgFdwState *)fmstate);
+    /* Set up the prepared statement on the remote server, if we didn't yet */    if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,    /*     * Execute the prepared statement.     */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,                             fmstate->p_name,
fmstate->p_nums,                            p_values,                             NULL,
NULL,                            0))
 
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);    /*     * Get the result, and check
forsuccess.
 
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,     * We don't use a PG_TRY block here, so be careful
notto throw error     * without releasing the PGresult.     */
 
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);    if (PQresultStatus(res) !=
(fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);    /* Check number of rows affected,
andfetch RETURNING tuple if any */    if (fmstate->has_returning)
 
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,    PGresult   *res;    int            n_rows;
+    vacate_connection((PgFdwState *)fmstate);
+    /* Set up the prepared statement on the remote server, if we didn't yet */    if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,    /*     * Execute the prepared statement.     */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,                             fmstate->p_name,
fmstate->p_nums,                            p_values,                             NULL,
NULL,                            0))
 
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);    /*     * Get the result, and check
forsuccess.
 
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,     * We don't use a PG_TRY block here, so be careful
notto throw error     * without releasing the PGresult.     */
 
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);    if (PQresultStatus(res) !=
(fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);    /* Check number of rows affected,
andfetch RETURNING tuple if any */    if (fmstate->has_returning)
 
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,    PGresult   *res;    int            n_rows;
+    vacate_connection((PgFdwState *)fmstate);
+    /* Set up the prepared statement on the remote server, if we didn't yet */    if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,    /*     * Execute the prepared statement.     */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,                             fmstate->p_name,
fmstate->p_nums,                            p_values,                             NULL,
NULL,                            0))
 
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);    /*     * Get the result, and check
forsuccess.
 
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,     * We don't use a PG_TRY block here, so be careful
notto throw error     * without releasing the PGresult.     */
 
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);    if (PQresultStatus(res) !=
(fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);    /* Check number of rows affected,
andfetch RETURNING tuple if any */    if (fmstate->has_returning)
 
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,         * We don't use a PG_TRY block here, so be
carefulnot to throw error         * without releasing the PGresult.         */
 
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);        if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);        PQclear(res);        fmstate->p_name =
NULL;   }    /* Release remote connection */
 
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;}/*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)     * Get connection to the foreign
server. Connection manager will     * establish new connection if necessary.     */
 
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));    /* Initialize state variable */
dmstate->num_tuples= -1;    /* -1 means not set yet */
 
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)     * If this is the first call after Begin,
executethe statement.     */    if (dmstate->num_tuples == -1)
 
+    {
+        vacate_connection((PgFdwState *)dmstate);        execute_dml_stmt(node);
+    }    /*     * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)        PQclear(dmstate->result);    /* Release
remoteconnection */
 
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;    /* MemoryContext will be deleted automatically. */}
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,        List       *local_param_join_conds;
StringInfoDatasql;        PGconn       *conn;
 
+        PgFdwConnpriv *connpriv;        Selectivity local_sel;        QualCost    local_cost;        List
*fdw_scan_tlist= NIL;
 
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,        /* Get the remote estimate */        conn =
GetConnection(fpinfo->user,false);
 
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+            vacate_connection(&tmpstate);
+        }
+        get_remote_estimate(sql.data, conn, &rows, &width,                            &startup_cost, &total_cost);
  ReleaseConnection(conn);
 
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,static
voidcreate_cursor(ForeignScanState*node){
 
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);    ExprContext *econtext = node->ss.ps.ps_ExprContext;    int
        numParams = fsstate->numParams;    const char **values = fsstate->param_values;
 
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;    StringInfoData buf;    PGresult   *res;
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node) * Fetch some more rows from the node's cursor. */static
void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node){
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* The connection should be vacant */
+    Assert(fsstate->s.connpriv->current_owner == NULL);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);    PGresult   *volatile res = NULL;    MemoryContext
oldcontext;
+    /* I should be the current connection owner */
+    Assert(fsstate->s.connpriv->current_owner == node);
+    /*     * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining     */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+    oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);    /* PGresult must be released before leaving this
function.*/    PG_TRY();    {
 
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;        char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;        int            i;        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
       fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);        /* On error, report the original query, not the FETCH. */        if
(PQresultStatus(res)!= PGRES_TUPLES_OK)            pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/*Convert the data into HeapTuples */
 
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)        {            Assert(IsA(node->ss.ps.plan, ForeignScan));
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =                make_tuple_from_result_row(res, i,
                           fsstate->rel,                                           fsstate->attinmeta,
 
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)        }        /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)            fsstate->fetch_ct_2++;
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+        /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);        PQclear(res);        res = NULL;    }
PG_CATCH();   {
 
+        fsstate->s.connpriv->current_owner = NULL;        if (res)            PQclear(res);        PG_RE_THROW();    }
  PG_END_TRY();
 
+    fsstate->s.connpriv->current_owner = NULL;
+    MemoryContextSwitchTo(oldcontext);}/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *owner;
+
+    if (connpriv == NULL || connpriv->current_owner == NULL)
+        return;
+
+    /*
+     * let the current connection owner read the result for the running query
+     */
+    owner = connpriv->current_owner;
+    fetch_received_data(owner);
+
+    /* Clear the waiting list */
+    while (owner)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+        fsstate->last_waiter = NULL;
+        owner = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+    if (owner)
+    {
+        PgFdwScanState *target_state = GetPgFdwScanState(owner);
+        PGconn *conn = target_state->s.conn;
+
+        while(PQisBusy(conn))
+            PQclear(PQgetResult(conn));
+        fsstate->s.connpriv->current_owner = NULL;
+        fsstate->async_waiting = false;
+    }
+}
+/* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is
unambiguousto the remote server. *
 
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)    /* Construct name we'll use for the prepared
statement.*/    snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
 
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));    p_name = pstrdup(prep_name);    /*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)     * the prepared statements we use in this
moduleare simple enough that     * the remote server will make the right choices.     */
 
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,                       p_name,                       fmstate->query,
          0,                       NULL))
 
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);    /*     * Get the result, and check
forsuccess.
 
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)     * We don't use a PG_TRY block here, so be
carefulnot to throw error     * without releasing the PGresult.     */
 
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);    if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);    PQclear(res);    /* This action
showsthat the prepare has been done. */
 
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)     * the desired result.  This allows us to avoid
assumingthat the remote     * server has the same OIDs we do for the parameters' types.     */
 
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,                           NULL, values, NULL,
NULL,0))
 
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);    /*     * Get the result, and check
forsuccess.
 
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)     * We don't use a PG_TRY block here, so be careful
notto throw error     * without releasing the PGresult.     */
 
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);    if (PQresultStatus(dmstate->result) !=
  (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
 
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,                           dmstate->query);
/* Get the number of rows affected. */
 
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,    /* XXX Consider parameterized paths for the
joinrelation */}
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->current_owner == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+/* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server.  As a
sideeffect, save information we obtain in
 
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,        PgFdwScanState *fdw_sstate;
Assert(fsstate);
-        fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+        fdw_sstate = GetPgFdwScanState(fsstate);        tupdesc = fdw_sstate->tupdesc;    }
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo    UserMapping *user;            /* only set in use_remote_estimate
mode*/    int            fetch_size;        /* fetch size for this remote table */
 
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */    /*     * Name of the relation while
EXPLAINingForeignScan. It is used for join
 
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);/* in connection.c */extern PGconn
*GetConnection(UserMapping*user, bool will_prep_stmt);
 
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);extern void ReleaseConnection(PGconn
*conn);externunsigned int GetCursorNumber(PGconn *conn);extern unsigned int GetPrepStmtNumber(PGconn *conn);
 
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ddfec79..56aae91 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');INSERT INTO b(aa) VALUES('bbbb');INSERT INTO b(aa)
VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, *
FROMONLY a;UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, *
FROMONLY a;UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, *
FROMONLY a;UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, *
FROMONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);insert into bar2 values(7,77,77);explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;-- Check UPDATE with inherited target and an
inheritedsource tableexplain (verbose, costs off)
 
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)delete from foo where f1 < 5 returning *;delete from foo where f1 < 5
returning*;explain (verbose, costs off)
 
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;drop table foo cascade;drop table bar
cascade;
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 20 Oct 2017 17:37:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171020.173707.12913619.horiguchi.kyotaro@lab.ntt.co.jp>
> The attached PoC patch theoretically has no impact on the normal
> code paths and just brings gain in async cases.

The parallel append just committed hit this and the attached are
the rebased version to the current HEAD. The result of a concise
performance test follows.

                           patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :  3562.32      3444.81        -3.4
B: local partitioning    :  1451.25      1604.38         9.5
C: single remote table   :  8818.92      9297.76         5.1
D: sharding (single con) :  5966.14      6646.73        10.2
E: sharding (multi con)  :  1802.25      6515.49        72.3

> A and B are degradation checks, which are expected to show no
> degradation.  C is the gain only by postgres_fdw's command
> presending on a remote table.  D is the gain of sharding on a
> connection. The number of partitions/shards is 4.  E is the gain
> using dedicate connection per shard.

Test A is accelerated by parallel sequential scan. Introducing
parallel append accelerates test B. Comparing A and B, I doubt
that degradation is stably measurable at least my environment but
I believe that there's no degradation theoreticaly. The test C to
E still shows apparent gain.
regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b1aff3362b983975003d8a60f9b3593cb2fa62fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index fc15181..7c4077a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     /* Create a reusable WaitEventSet. */
     if (cv_wait_event_set == NULL)
     {
-        cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+        cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
         AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
     }
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4c35ccf..e00e39c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
     ResourceArray snapshotarr;    /* snapshot references */
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
                 PrintDSMLeakWarning(res);
             dsm_detach(res);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->snapshotarr.nitems == 0);
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->snapshotarr));
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
     elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
          dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a43193c..997ee8d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                   Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
                        dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.9.2

From 9c1273a4868bed5eb0991f842296cb89c10470bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 110 ++++++++++++++
 src/backend/executor/nodeAppend.c       | 247 +++++++++++++++++++++++++++-----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 13 files changed, 462 insertions(+), 45 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+            refind = (int *) palloc(refindsize * sizeof(int));
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 0e93713..f21ab36 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     ListCell   *lc;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->ps.ExecProcNode = ExecAppend;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
+    appendstate->as_nasyncplans = node->nasyncplans;
+    appendstate->as_syncdone = (node->nasyncplans == nplans);
+    appendstate->as_asyncresult = (TupleTableSlot **)
+        palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+    /* Choose async version of Exec function */
+    if (appendstate->as_nasyncplans > 0)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    /* initially, all async requests need a request */
+    for (i = 0; i < appendstate->as_nasyncplans; ++i)
+        appendstate->as_needrequest =
+            bms_add_member(appendstate->as_needrequest, i);
 
     /*
      * Miscellaneous initialization
@@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     foreach(lc, node->appendplans)
     {
         Plan       *initNode = (Plan *) lfirst(lc);
+        int            sub_eflags = eflags;
+
+        if (i < appendstate->as_nasyncplans)
+            sub_eflags |= EXEC_FLAG_ASYNC;
 
-        appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
         i++;
     }
 
+    /* if there's any async-capable subnode, use async-aware routine */
+    if (appendstate->as_nasyncplans)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
     /*
      * initialize output tuple type
      */
     ExecAssignResultTypeFromTL(&appendstate->ps);
     appendstate->ps.ps_ProjInfo = NULL;
 
-    /*
-     * Parallel-aware append plans must choose the first subplan to execute by
-     * looking at shared memory, but non-parallel-aware append plans can
-     * always start with the first subplan.
-     */
-    appendstate->as_whichplan =
-        appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+    if (appendstate->ps.plan->parallel_aware)
+    {
+        /*
+         * Parallel-aware append plans must choose the first subplan to
+         * execute by looking at shared memory, but non-parallel-aware append
+         * plans can always start with the first subplan.
+         */
 
-    /* If parallel-aware, this will be overridden later. */
-    appendstate->choose_next_subplan = choose_next_subplan_locally;
+        appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+
+        /* If parallel-aware, this will be overridden later. */
+        appendstate->choose_next_subplan = choose_next_subplan_locally;
+    }
+    else
+    {
+        appendstate->as_whichsyncplan = 0;
+
+        /*
+         * initialize to scan first synchronous subplan
+         */
+        appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+        appendstate->choose_next_subplan = choose_next_subplan_locally;
+    }
 
     return appendstate;
 }
@@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate)
     AppendState *node = castNode(AppendState, pstate);
 
     /* If no subplan has been chosen, we must choose one before proceeding. */
-    if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+    if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
         !node->choose_next_subplan(node))
         return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+                                       timeout);
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan in the appropriate direction. If no
+         * more subplans, return the empty slot set up for us by
+         * ExecInitAppend, unless there are async plans we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node)
 {
     int            i;
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node)
             ExecReScan(subnode);
     }
 
-    node->as_whichplan =
+    node->as_whichsyncplan =
         node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
 
     /* We should never see INVALID_SUBPLAN_INDEX in this case. */
     Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node)
     {
         if (whichplan >= node->as_nplans - 1)
             return false;
-        node->as_whichplan++;
+        node->as_whichsyncplan++;
     }
     else
     {
         if (whichplan <= 0)
             return false;
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     return true;
@@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -464,8 +643,8 @@ choose_next_subplan_for_worker(AppendState *node)
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /* If all the plans are already done, we have nothing to do */
     if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -490,10 +669,10 @@ choose_next_subplan_for_worker(AppendState *node)
         else
         {
             /* At last plan, no partial plans, arrange to bail out. */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -503,7 +682,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan++;
+    node->as_whichsyncplan = pstate->pa_next_plan++;
     if (pstate->pa_next_plan >= node->as_nplans)
     {
         if (append->first_partial_plan < node->as_nplans)
@@ -519,8 +698,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index dc6cfcf..afc8a58 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
                     (ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *        ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0..402db1e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
                    Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels);
+                           int nasyncplans,    int referent,
+                           List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
                      Plan *lefttree,
                      Plan *righttree,
@@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                  List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                          GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
     Append       *plan;
     List       *tlist = build_path_tlist(root, &best_path->path);
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
     ListCell   *subpaths;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
         /* Must insist that all children return the same tlist */
         subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;
     }
 
     /*
@@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
      * parent-rel Vars it'll be asked to emit.
      */
 
-    plan = make_append(subplans, best_path->first_partial_path,
-                       tlist, best_path->partitioned_rels);
+    plan = make_append(list_concat(asyncplans, syncplans),
+                       best_path->first_partial_path, nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels);
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5296,8 +5318,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+            int referent, List *tlist, List *partitioned_rels)
 {
     Append       *node = makeNode(Append);
     Plan       *plan = &node->plan;
@@ -5309,6 +5331,8 @@ make_append(List *appendplans, int first_partial_plan,
     node->partitioned_rels = partitioned_rels;
     node->appendplans = appendplans;
     node->first_partial_plan = first_partial_plan;
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;
 
     return node;
 }
@@ -6646,3 +6670,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..09ea33b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..bd622c9 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS    0x0040    /* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA    0x0080    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 152abf0..1d95e39 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                 ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..566236b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c..c049251 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -843,6 +843,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -883,6 +889,9 @@ typedef struct PlanState
     TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
     ExprContext *ps_ExprContext;    /* node's expression-evaluation context */
     ProjectionInfo *ps_ProjInfo;    /* info for doing tuple projection */
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1012,10 +1021,16 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     Size        pstate_len;        /* size of parallel coordination info */
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1566,6 +1581,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366..a6df261 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -249,6 +249,8 @@ typedef struct Append
     List       *partitioned_rels;
     List       *appendplans;
     int            first_partial_plan;
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..fe9d39c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

From d0882fbc09fce447e642278292b70e4a6b73575e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4fbf043..646085f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641..3b4eefa 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..0688504 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState *current_owner;    /* The node currently running a query
+                                         * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        async_waiting;    /* true if requesting the parent to wait */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* A waiting node at the end of a waiting
+                                 * list. Maintained only by the current
+                                     * owner of the connection */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                 RangeTblEntry *target_rte,
                                 Relation target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                              UpperRelationKind stage,
                              RelOptInfo *input_rel,
                              RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                           EquivalenceClass *ec, EquivalenceMember *em,
                           void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->current_owner = NULL;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->async_waiting = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
     /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
      * Get some more tuples, if we've run out.
      */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
+        ForeignScanState *next_conn_owner = node;
+
+        /* This node has sent a query on this connection */
+        if (fsstate->s.connpriv->current_owner == node)
+        {
+            /* Check if the result is available */
+            if (PQisBusy(fsstate->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(fsstate->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+                {
+                    /*
+                     * This node is not ready yet. Tell the caller to wait.
+                     */
+                    fsstate->result_ready = false;
+                    node->ss.ps.asyncstate = AS_WAITING;
+                    return ExecClearTuple(slot);
+                }
+            }
+
+            Assert(fsstate->async_waiting);
+            fsstate->async_waiting = false;
+            fetch_received_data(node);
+
+            /*
+             * If someone is waiting this node on the same connection, let the
+             * first waiter be the next owner of this connection.
+             */
+            if (fsstate->waiter)
+            {
+                PgFdwScanState *next_owner_state;
+
+                next_conn_owner = fsstate->waiter;
+                next_owner_state = GetPgFdwScanState(next_conn_owner);
+                fsstate->waiter = NULL;
+
+                /*
+                 * only the current owner is responsible to maintain the shortcut
+                 * to the last waiter
+                 */
+                next_owner_state->last_waiter = fsstate->last_waiter;
+
+                /*
+                 * for simplicity, last_waiter points itself on a node that no one
+                 * is waiting for.
+                 */
+                fsstate->last_waiter = node;
+            }
+        }
+        else if (fsstate->s.connpriv->current_owner &&
+                 !GetPgFdwScanState(node)->eof_reached)
+        {
+            /*
+             * Anyone else is holding this connection and we want this node to
+             * run later. Add myself to the tail of the waiters' list then
+             * return not-ready.  To avoid scanning through the waiters' list,
+             * the current owner is to maintain the shortcut to the last
+             * waiter.
+             */
+            PgFdwScanState *conn_owner_state =
+                GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+            ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+            PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+            last_waiter_state->waiter = node;
+            conn_owner_state->last_waiter = node;
+
+            /* Register the node to the async-waiting node list */
+            Assert(!GetPgFdwScanState(node)->async_waiting);
+
+            GetPgFdwScanState(node)->async_waiting = true;
+
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+            return ExecClearTuple(slot);
+        }
+
+        /* At this time no node is running on the connection */
+        Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+               == NULL);
+        /*
+         * Send the next request for the next owner of this connection if
+         * needed.
+         */
+        if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+        {
+            PgFdwScanState *next_owner_state =
+                GetPgFdwScanState(next_conn_owner);
+
+            request_more_data(next_conn_owner);
+
+            /* Register the node to the async-waiting node list */
+            if (!next_owner_state->async_waiting)
+                next_owner_state->async_waiting = true;
+
+            if (!next_conn_owner->fs_async)
+                fetch_received_data(next_conn_owner);
+        }
+
+
+        /*
+         * If we haven't received a result for the given node this time,
+         * return with no tuple to give way to other nodes.
+         */
         if (fsstate->next_tuple >= fsstate->num_tuples)
+        {
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
             return ExecClearTuple(slot);
+        }
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
                    slot,
                    InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Deconstruct fdw_private data. */
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
     /* Initialize state variable */
     dmstate->num_tuples = -1;    /* -1 means not set yet */
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        vacate_connection((PgFdwState *)dmstate);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnpriv *connpriv;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+            vacate_connection(&tmpstate);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* The connection should be vacant */
+    Assert(fsstate->s.connpriv->current_owner == NULL);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
 
+    /* I should be the current connection owner */
+    Assert(fsstate->s.connpriv->current_owner == node);
+
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
         PQclear(res);
         res = NULL;
     }
     PG_CATCH();
     {
+        fsstate->s.connpriv->current_owner = NULL;
         if (res)
             PQclear(res);
         PG_RE_THROW();
     }
     PG_END_TRY();
 
+    fsstate->s.connpriv->current_owner = NULL;
+
     MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *owner;
+
+    if (connpriv == NULL || connpriv->current_owner == NULL)
+        return;
+
+    /*
+     * let the current connection owner read the result for the running query
+     */
+    owner = connpriv->current_owner;
+    fetch_received_data(owner);
+
+    /* Clear the waiting list */
+    while (owner)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+        fsstate->last_waiter = NULL;
+        owner = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+    if (owner)
+    {
+        PgFdwScanState *target_state = GetPgFdwScanState(owner);
+        PGconn *conn = target_state->s.conn;
+
+        while(PQisBusy(conn))
+            PQclear(PQgetResult(conn));
+        fsstate->s.connpriv->current_owner = NULL;
+        fsstate->async_waiting = false;
+    }
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->current_owner == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,
         PgFdwScanState *fdw_sstate;
 
         Assert(fsstate);
-        fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+        fdw_sstate = GetPgFdwScanState(fsstate);
         tupdesc = fdw_sstate->tupdesc;
     }
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c7..cb9caa5 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2


Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp>
> > The attached PoC patch theoretically has no impact on the normal
> > code paths and just brings gain in async cases.
> 
> The parallel append just committed hit this and the attached are
> the rebased version to the current HEAD. The result of a concise
> performance test follows.
> 
>                            patched(ms)  unpatched(ms)   gain(%)
> A: simple table scan     :  3562.32      3444.81        -3.4
> B: local partitioning    :  1451.25      1604.38         9.5
> C: single remote table   :  8818.92      9297.76         5.1
> D: sharding (single con) :  5966.14      6646.73        10.2
> E: sharding (multi con)  :  1802.25      6515.49        72.3
> 
> > A and B are degradation checks, which are expected to show no
> > degradation.  C is the gain only by postgres_fdw's command
> > presending on a remote table.  D is the gain of sharding on a
> > connection. The number of partitions/shards is 4.  E is the gain
> > using dedicate connection per shard.
> 
> Test A is accelerated by parallel sequential scan. Introducing
> parallel append accelerates test B. Comparing A and B, I doubt
> that degradation is stably measurable at least my environment but
> I believe that there's no degradation theoreticaly. The test C to
> E still shows apparent gain.
> regards,

The patch conflicts with 3cac0ec. This is the rebased version.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From be22b33b90abec93a2a609a1db4955e6910b2da0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4d..890972b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7..5457899 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5ba..30edc8e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e09a4f1..7ae8777 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
     ResourceArray snapshotarr;    /* snapshot references */
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
                 PrintDSMLeakWarning(res);
             dsm_detach(res);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->snapshotarr.nitems == 0);
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->snapshotarr));
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
     elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
          dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48..838845a 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                   Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 22b377c..56f2059 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
                        dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.9.2

From 885f62d89a93edbda44330c3ecc3a7ac08e302ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 110 ++++++++++++++
 src/backend/executor/nodeAppend.c       | 247 +++++++++++++++++++++++++++-----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 13 files changed, 462 insertions(+), 45 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+            refind = (int *) palloc(refindsize * sizeof(int));
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 64a17fb..644af5b 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     ListCell   *lc;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->ps.ExecProcNode = ExecAppend;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
+    appendstate->as_nasyncplans = node->nasyncplans;
+    appendstate->as_syncdone = (node->nasyncplans == nplans);
+    appendstate->as_asyncresult = (TupleTableSlot **)
+        palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+    /* Choose async version of Exec function */
+    if (appendstate->as_nasyncplans > 0)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    /* initially, all async requests need a request */
+    for (i = 0; i < appendstate->as_nasyncplans; ++i)
+        appendstate->as_needrequest =
+            bms_add_member(appendstate->as_needrequest, i);
 
     /*
      * Miscellaneous initialization
@@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     foreach(lc, node->appendplans)
     {
         Plan       *initNode = (Plan *) lfirst(lc);
+        int            sub_eflags = eflags;
+
+        if (i < appendstate->as_nasyncplans)
+            sub_eflags |= EXEC_FLAG_ASYNC;
 
-        appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
         i++;
     }
 
+    /* if there's any async-capable subnode, use async-aware routine */
+    if (appendstate->as_nasyncplans)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
     /*
      * initialize output tuple type
      */
     ExecAssignResultTypeFromTL(&appendstate->ps);
     appendstate->ps.ps_ProjInfo = NULL;
 
-    /*
-     * Parallel-aware append plans must choose the first subplan to execute by
-     * looking at shared memory, but non-parallel-aware append plans can
-     * always start with the first subplan.
-     */
-    appendstate->as_whichplan =
-        appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+    if (appendstate->ps.plan->parallel_aware)
+    {
+        /*
+         * Parallel-aware append plans must choose the first subplan to
+         * execute by looking at shared memory, but non-parallel-aware append
+         * plans can always start with the first subplan.
+         */
 
-    /* If parallel-aware, this will be overridden later. */
-    appendstate->choose_next_subplan = choose_next_subplan_locally;
+        appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+
+        /* If parallel-aware, this will be overridden later. */
+        appendstate->choose_next_subplan = choose_next_subplan_locally;
+    }
+    else
+    {
+        appendstate->as_whichsyncplan = 0;
+
+        /*
+         * initialize to scan first synchronous subplan
+         */
+        appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+        appendstate->choose_next_subplan = choose_next_subplan_locally;
+    }
 
     return appendstate;
 }
@@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate)
     AppendState *node = castNode(AppendState, pstate);
 
     /* If no subplan has been chosen, we must choose one before proceeding. */
-    if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+    if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
         !node->choose_next_subplan(node))
         return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+                                       timeout);
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan in the appropriate direction. If no
+         * more subplans, return the empty slot set up for us by
+         * ExecInitAppend, unless there are async plans we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node)
 {
     int            i;
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node)
             ExecReScan(subnode);
     }
 
-    node->as_whichplan =
+    node->as_whichsyncplan =
         node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
 
     /* We should never see INVALID_SUBPLAN_INDEX in this case. */
     Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node)
     {
         if (whichplan >= node->as_nplans - 1)
             return false;
-        node->as_whichplan++;
+        node->as_whichsyncplan++;
     }
     else
     {
         if (whichplan <= 0)
             return false;
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     return true;
@@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -463,8 +642,8 @@ choose_next_subplan_for_worker(AppendState *node)
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /* If all the plans are already done, we have nothing to do */
     if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -489,10 +668,10 @@ choose_next_subplan_for_worker(AppendState *node)
         else
         {
             /* At last plan, no partial plans, arrange to bail out. */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -502,7 +681,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan++;
+    node->as_whichsyncplan = pstate->pa_next_plan++;
     if (pstate->pa_next_plan >= node->as_nplans)
     {
         if (append->first_partial_plan < node->as_nplans)
@@ -518,8 +697,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 59865f5..9cb5470 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
                     (ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *        ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..d85cb9c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
                    Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels);
+                           int nasyncplans,    int referent,
+                           List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
                      Plan *lefttree,
                      Plan *righttree,
@@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                  List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                          GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
     Append       *plan;
     List       *tlist = build_path_tlist(root, &best_path->path);
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
     ListCell   *subpaths;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
         /* Must insist that all children return the same tlist */
         subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;
     }
 
     /*
@@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
      * parent-rel Vars it'll be asked to emit.
      */
 
-    plan = make_append(subplans, best_path->first_partial_path,
-                       tlist, best_path->partitioned_rels);
+    plan = make_append(list_concat(asyncplans, syncplans),
+                       best_path->first_partial_path, nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels);
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5307,8 +5329,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+            int referent, List *tlist, List *partitioned_rels)
 {
     Append       *node = makeNode(Append);
     Plan       *plan = &node->plan;
@@ -5320,6 +5342,8 @@ make_append(List *appendplans, int first_partial_plan,
     node->partitioned_rels = partitioned_rels;
     node->appendplans = appendplans;
     node->first_partial_plan = first_partial_plan;
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;
 
     return node;
 }
@@ -6656,3 +6680,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d130114..667878b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3673,6 +3673,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6545a80..60f4e51 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS    0x0040    /* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA    0x0080    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be..67abf8e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                 ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e88fee3..beb3f0d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..405ad7b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -851,6 +851,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -891,6 +897,9 @@ typedef struct PlanState
     TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
     ExprContext *ps_ExprContext;    /* node's expression-evaluation context */
     ProjectionInfo *ps_ProjInfo;    /* info for doing tuple projection */
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1013,10 +1022,16 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     Size        pstate_len;        /* size of parallel coordination info */
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1567,6 +1582,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..b4535f0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -249,6 +249,8 @@ typedef struct Append
     List       *partitioned_rels;
     List       *appendplans;
     int            first_partial_plan;
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3d3c0b6..a1ba26f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -831,7 +831,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

From 6612fbe0cab492fedead1d35f1b9cdf24f3e6dd4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 00c926b..4f3d59d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641..3b4eefa 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 7992ba5..5ea1d88 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState *current_owner;    /* The node currently running a query
+                                         * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        async_waiting;    /* true if requesting the parent to wait */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* A waiting node at the end of a waiting
+                                 * list. Maintained only by the current
+                                     * owner of the connection */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                 RangeTblEntry *target_rte,
                                 Relation target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                              UpperRelationKind stage,
                              RelOptInfo *input_rel,
                              RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                           EquivalenceClass *ec, EquivalenceMember *em,
                           void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->current_owner = NULL;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->async_waiting = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
     /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
      * Get some more tuples, if we've run out.
      */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
+        ForeignScanState *next_conn_owner = node;
+
+        /* This node has sent a query on this connection */
+        if (fsstate->s.connpriv->current_owner == node)
+        {
+            /* Check if the result is available */
+            if (PQisBusy(fsstate->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(fsstate->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+                {
+                    /*
+                     * This node is not ready yet. Tell the caller to wait.
+                     */
+                    fsstate->result_ready = false;
+                    node->ss.ps.asyncstate = AS_WAITING;
+                    return ExecClearTuple(slot);
+                }
+            }
+
+            Assert(fsstate->async_waiting);
+            fsstate->async_waiting = false;
+            fetch_received_data(node);
+
+            /*
+             * If someone is waiting this node on the same connection, let the
+             * first waiter be the next owner of this connection.
+             */
+            if (fsstate->waiter)
+            {
+                PgFdwScanState *next_owner_state;
+
+                next_conn_owner = fsstate->waiter;
+                next_owner_state = GetPgFdwScanState(next_conn_owner);
+                fsstate->waiter = NULL;
+
+                /*
+                 * only the current owner is responsible to maintain the shortcut
+                 * to the last waiter
+                 */
+                next_owner_state->last_waiter = fsstate->last_waiter;
+
+                /*
+                 * for simplicity, last_waiter points itself on a node that no one
+                 * is waiting for.
+                 */
+                fsstate->last_waiter = node;
+            }
+        }
+        else if (fsstate->s.connpriv->current_owner &&
+                 !GetPgFdwScanState(node)->eof_reached)
+        {
+            /*
+             * Anyone else is holding this connection and we want this node to
+             * run later. Add myself to the tail of the waiters' list then
+             * return not-ready.  To avoid scanning through the waiters' list,
+             * the current owner is to maintain the shortcut to the last
+             * waiter.
+             */
+            PgFdwScanState *conn_owner_state =
+                GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+            ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+            PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+            last_waiter_state->waiter = node;
+            conn_owner_state->last_waiter = node;
+
+            /* Register the node to the async-waiting node list */
+            Assert(!GetPgFdwScanState(node)->async_waiting);
+
+            GetPgFdwScanState(node)->async_waiting = true;
+
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+            return ExecClearTuple(slot);
+        }
+
+        /* At this time no node is running on the connection */
+        Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+               == NULL);
+        /*
+         * Send the next request for the next owner of this connection if
+         * needed.
+         */
+        if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+        {
+            PgFdwScanState *next_owner_state =
+                GetPgFdwScanState(next_conn_owner);
+
+            request_more_data(next_conn_owner);
+
+            /* Register the node to the async-waiting node list */
+            if (!next_owner_state->async_waiting)
+                next_owner_state->async_waiting = true;
+
+            if (!next_conn_owner->fs_async)
+                fetch_received_data(next_conn_owner);
+        }
+
+
+        /*
+         * If we haven't received a result for the given node this time,
+         * return with no tuple to give way to other nodes.
+         */
         if (fsstate->next_tuple >= fsstate->num_tuples)
+        {
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
             return ExecClearTuple(slot);
+        }
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
                    slot,
                    InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Deconstruct fdw_private data. */
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
     /* Initialize state variable */
     dmstate->num_tuples = -1;    /* -1 means not set yet */
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        vacate_connection((PgFdwState *)dmstate);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnpriv *connpriv;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+            vacate_connection(&tmpstate);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* The connection should be vacant */
+    Assert(fsstate->s.connpriv->current_owner == NULL);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
 
+    /* I should be the current connection owner */
+    Assert(fsstate->s.connpriv->current_owner == node);
+
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
         PQclear(res);
         res = NULL;
     }
     PG_CATCH();
     {
+        fsstate->s.connpriv->current_owner = NULL;
         if (res)
             PQclear(res);
         PG_RE_THROW();
     }
     PG_END_TRY();
 
+    fsstate->s.connpriv->current_owner = NULL;
+
     MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *owner;
+
+    if (connpriv == NULL || connpriv->current_owner == NULL)
+        return;
+
+    /*
+     * let the current connection owner read the result for the running query
+     */
+    owner = connpriv->current_owner;
+    fetch_received_data(owner);
+
+    /* Clear the waiting list */
+    while (owner)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+        fsstate->last_waiter = NULL;
+        owner = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+    if (owner)
+    {
+        PgFdwScanState *target_state = GetPgFdwScanState(owner);
+        PGconn *conn = target_state->s.conn;
+
+        while(PQisBusy(conn))
+            PQclear(PQgetResult(conn));
+        fsstate->s.connpriv->current_owner = NULL;
+        fsstate->async_waiting = false;
+    }
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->current_owner == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,
         PgFdwScanState *fdw_sstate;
 
         Assert(fsstate);
-        fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+        fdw_sstate = GetPgFdwScanState(fsstate);
         tupdesc = fdw_sstate->tupdesc;
     }
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1ae809d..58ef26e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c7..cb9caa5 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2


Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
At Thu, 11 Jan 2018 17:08:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180111.170839.23674040.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > The attached PoC patch theoretically has no impact on the normal
> > > code paths and just brings gain in async cases.
> > 
> > The parallel append just committed hit this and the attached are
> > the rebased version to the current HEAD. The result of a concise
> > performance test follows.
> > 
> >                            patched(ms)  unpatched(ms)   gain(%)
> > A: simple table scan     :  3562.32      3444.81        -3.4
> > B: local partitioning    :  1451.25      1604.38         9.5
> > C: single remote table   :  8818.92      9297.76         5.1
> > D: sharding (single con) :  5966.14      6646.73        10.2
> > E: sharding (multi con)  :  1802.25      6515.49        72.3
> > 
> > > A and B are degradation checks, which are expected to show no
> > > degradation.  C is the gain only by postgres_fdw's command
> > > presending on a remote table.  D is the gain of sharding on a
> > > connection. The number of partitions/shards is 4.  E is the gain
> > > using dedicate connection per shard.
> > 
> > Test A is accelerated by parallel sequential scan. Introducing
> > parallel append accelerates test B. Comparing A and B, I doubt
> > that degradation is stably measurable at least my environment but
> > I believe that there's no degradation theoreticaly. The test C to
> > E still shows apparent gain.
> > regards,
> 
> The patch conflicts with 3cac0ec. This is the rebased version.

It hadn't been workable itself for a long time.

- Rebased to current master.
  (Removed some wrongly-inserted lines)
- Fixed wrong-positioned  assertion in postgres_fdw.c
  (Caused assertion failure on normal usage)
- Properly reset persistent (static) variable.
  (Caused SEGV under certain condition)
- Fixed explain output of async-mixed append plan.
  (Choose proper subnode as the referent node)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

From 6ab58d3fb02f716deaa207824747646dd8c2a448 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4d..890972b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7..5457899 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5ba..30edc8e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e09a4f1..7ae8777 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
     ResourceArray snapshotarr;    /* snapshot references */
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
                 PrintDSMLeakWarning(res);
             dsm_detach(res);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->snapshotarr.nitems == 0);
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->snapshotarr));
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
     elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
          dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48..838845a 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                   Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 22b377c..56f2059 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
                        dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.9.2

From 60c663b3059e10302a71023eccb275da51331b39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 145 ++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 228 ++++++++++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 ++++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/include/executor/execAsync.h        |  23 ++++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 14 files changed, 489 insertions(+), 42 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..db477e2
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 7a3dd2e..df1f7ae 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     ListCell   *lc;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->ps.ExecProcNode = ExecAppend;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
+    appendstate->as_nasyncplans = node->nasyncplans;
+    appendstate->as_syncdone = (node->nasyncplans == nplans);
+    appendstate->as_asyncresult = (TupleTableSlot **)
+        palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+    /* Choose async version of Exec function */
+    if (appendstate->as_nasyncplans > 0)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    /* initially, all async requests need a request */
+    for (i = 0; i < appendstate->as_nasyncplans; ++i)
+        appendstate->as_needrequest =
+            bms_add_member(appendstate->as_needrequest, i);
 
     /*
      * Initialize result tuple type and slot.
@@ -141,11 +156,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     foreach(lc, node->appendplans)
     {
         Plan       *initNode = (Plan *) lfirst(lc);
+        int            sub_eflags = eflags;
 
-        appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+        if (i < appendstate->as_nasyncplans)
+            sub_eflags |= EXEC_FLAG_ASYNC;
+
+        appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
         i++;
     }
 
+    /* if there's any async-capable subnode, use async-aware routine */
+    if (appendstate->as_nasyncplans)
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
     /*
      * Miscellaneous initialization
      *
@@ -159,8 +182,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      * looking at shared memory, but non-parallel-aware append plans can
      * always start with the first subplan.
      */
-    appendstate->as_whichplan =
-        appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+    if (appendstate->ps.plan->parallel_aware)
+        appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+    else if (appendstate->as_nasyncplans > 0)
+        appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+    else
+        appendstate->as_whichsyncplan = 0;
 
     /* If parallel-aware, this will be overridden later. */
     appendstate->choose_next_subplan = choose_next_subplan_locally;
@@ -180,10 +207,12 @@ ExecAppend(PlanState *pstate)
     AppendState *node = castNode(AppendState, pstate);
 
     /* If no subplan has been chosen, we must choose one before proceeding. */
-    if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+    if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
         !node->choose_next_subplan(node))
         return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -194,8 +223,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -218,6 +248,137 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+                                       timeout);
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan in the appropriate direction. If no
+         * more subplans, return the empty slot set up for us by
+         * ExecInitAppend, unless there are async plans we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -251,6 +412,15 @@ ExecReScanAppend(AppendState *node)
 {
     int            i;
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -270,7 +440,7 @@ ExecReScanAppend(AppendState *node)
             ExecReScan(subnode);
     }
 
-    node->as_whichplan =
+    node->as_whichsyncplan =
         node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -359,7 +529,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
 
     /* We should never see INVALID_SUBPLAN_INDEX in this case. */
     Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -368,13 +538,13 @@ choose_next_subplan_locally(AppendState *node)
     {
         if (whichplan >= node->as_nplans - 1)
             return false;
-        node->as_whichplan++;
+        node->as_whichsyncplan++;
     }
     else
     {
         if (whichplan <= 0)
             return false;
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     return true;
@@ -399,33 +569,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -457,8 +627,8 @@ choose_next_subplan_for_worker(AppendState *node)
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /* If all the plans are already done, we have nothing to do */
     if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -468,7 +638,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
@@ -478,7 +648,7 @@ choose_next_subplan_for_worker(AppendState *node)
             /* Advance to next plan. */
             pstate->pa_next_plan++;
         }
-        else if (node->as_whichplan > append->first_partial_plan)
+        else if (node->as_whichsyncplan > append->first_partial_plan)
         {
             /* Loop back to first partial plan. */
             pstate->pa_next_plan = append->first_partial_plan;
@@ -489,10 +659,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -502,7 +672,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan++;
+    node->as_whichsyncplan = pstate->pa_next_plan++;
     if (pstate->pa_next_plan >= node->as_nplans)
     {
         if (append->first_partial_plan < node->as_nplans)
@@ -518,8 +688,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < append->first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < append->first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0084234..7da1ac5 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
                     (ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *        ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -383,3 +386,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index da0cc7f..24c838d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
                    Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels);
+                           int nasyncplans,    int referent,
+                           List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
                      Plan *lefttree,
                      Plan *righttree,
@@ -287,6 +288,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                  List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                          GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1020,8 +1022,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
     Append       *plan;
     List       *tlist = build_path_tlist(root, &best_path->path);
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
     ListCell   *subpaths;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1056,7 +1062,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
         /* Must insist that all children return the same tlist */
         subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;
     }
 
     /*
@@ -1066,8 +1086,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
      * parent-rel Vars it'll be asked to emit.
      */
 
-    plan = make_append(subplans, best_path->first_partial_path,
-                       tlist, best_path->partitioned_rels);
+    plan = make_append(list_concat(asyncplans, syncplans),
+                       best_path->first_partial_path, nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels);
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5319,8 +5341,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+            int referent, List *tlist, List *partitioned_rels)
 {
     Append       *node = makeNode(Append);
     Plan       *plan = &node->plan;
@@ -5332,6 +5354,8 @@ make_append(List *appendplans, int first_partial_plan,
     node->partitioned_rels = partitioned_rels;
     node->appendplans = appendplans;
     node->first_partial_plan = first_partial_plan;
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;
 
     return node;
 }
@@ -6677,3 +6701,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 96ba216..08eac23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3676,6 +3676,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ba9fab4..6837642 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4463,7 +4463,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
     dpns->planstate = ps;
 
     /*
-     * We special-case Append and MergeAppend to pretend that the first child
+     * We special-case Append and MergeAppend to pretend that a specific child
      * plan is the OUTER referent; we have to interpret OUTER Vars in their
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4471,7 +4471,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
      * lists containing references to non-target relations.
      */
     if (IsA(ps, AppendState))
-        dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+    {
+        AppendState *aps = (AppendState *) ps;
+        Append *app = (Append *) ps->plan;
+        dpns->outer_planstate = aps->appendplans[app->referent];
+    }
     else if (IsA(ps, MergeAppendState))
         dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
     else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 45a077a..54cc358 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -64,6 +64,7 @@
 #define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS    0x0040    /* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA    0x0080    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be..67abf8e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                 ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e88fee3..beb3f0d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820..c9c3db2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -861,6 +861,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -901,6 +907,9 @@ typedef struct PlanState
     TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
     ExprContext *ps_ExprContext;    /* node's expression-evaluation context */
     ProjectionInfo *ps_ProjInfo;    /* info for doing tuple projection */
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1023,10 +1032,16 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     Size        pstate_len;        /* size of parallel coordination info */
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1577,6 +1592,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19ea..64ee18e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -250,6 +250,8 @@ typedef struct Append
     List       *partitioned_rels;
     List       *appendplans;
     int            first_partial_plan;
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..6f4583b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

From c2195953a34fe7c0574631e5c118a948263dc755 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 00c926b..4f3d59d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 262c635..29ba813 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6790,7 +6790,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6818,7 +6818,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6846,7 +6846,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6874,7 +6874,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6940,35 +6940,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6978,35 +6983,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7036,11 +7046,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7054,11 +7064,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -7089,16 +7099,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7116,16 +7126,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7276,27 +7286,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 941a2e7..337c728 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState *current_owner;    /* The node currently running a query
+                                         * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        async_waiting;    /* true if requesting the parent to wait */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* A waiting node at the end of a waiting
+                                 * list. Maintained only by the current
+                                     * owner of the connection */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -293,6 +324,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                 RangeTblEntry *target_rte,
                                 Relation target_relation);
@@ -353,6 +385,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                              UpperRelationKind stage,
                              RelOptInfo *input_rel,
                              RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -373,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                           EquivalenceClass *ec, EquivalenceMember *em,
                           void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -452,6 +491,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -486,6 +526,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1336,12 +1380,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->current_owner = NULL;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->async_waiting = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1397,32 +1450,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
     /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
      * Get some more tuples, if we've run out.
      */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
+        ForeignScanState *next_conn_owner = node;
+
+        /* This node has sent a query on this connection */
+        if (fsstate->s.connpriv->current_owner == node)
+        {
+            /* Check if the result is available */
+            if (PQisBusy(fsstate->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(fsstate->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+                {
+                    /*
+                     * This node is not ready yet. Tell the caller to wait.
+                     */
+                    fsstate->result_ready = false;
+                    node->ss.ps.asyncstate = AS_WAITING;
+                    return ExecClearTuple(slot);
+                }
+            }
+
+            Assert(fsstate->async_waiting);
+            fsstate->async_waiting = false;
+            fetch_received_data(node);
+
+            /*
+             * If someone is waiting this node on the same connection, let the
+             * first waiter be the next owner of this connection.
+             */
+            if (fsstate->waiter)
+            {
+                PgFdwScanState *next_owner_state;
+
+                next_conn_owner = fsstate->waiter;
+                next_owner_state = GetPgFdwScanState(next_conn_owner);
+                fsstate->waiter = NULL;
+
+                /*
+                 * only the current owner is responsible to maintain the shortcut
+                 * to the last waiter
+                 */
+                next_owner_state->last_waiter = fsstate->last_waiter;
+
+                /*
+                 * for simplicity, last_waiter points itself on a node that no one
+                 * is waiting for.
+                 */
+                fsstate->last_waiter = node;
+            }
+        }
+        else if (fsstate->s.connpriv->current_owner &&
+                 !GetPgFdwScanState(node)->eof_reached)
+        {
+            /*
+             * Anyone else is holding this connection and we want this node to
+             * run later. Add myself to the tail of the waiters' list then
+             * return not-ready.  To avoid scanning through the waiters' list,
+             * the current owner is to maintain the shortcut to the last
+             * waiter.
+             */
+            PgFdwScanState *conn_owner_state =
+                GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+            ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+            PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+            last_waiter_state->waiter = node;
+            conn_owner_state->last_waiter = node;
+
+            /* Register the node to the async-waiting node list */
+            Assert(!GetPgFdwScanState(node)->async_waiting);
+
+            GetPgFdwScanState(node)->async_waiting = true;
+
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
             return ExecClearTuple(slot);
+        }
+
+        /*
+         * Send the next request for the next owner of this connection if
+         * needed.
+         */
+        if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+        {
+            PgFdwScanState *next_owner_state =
+                GetPgFdwScanState(next_conn_owner);
+
+            /* No one is running on this connection at this time */
+            Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+                   == NULL);
+            request_more_data(next_conn_owner);
+
+            /* Register the node to the async-waiting node list */
+            if (!next_owner_state->async_waiting)
+                next_owner_state->async_waiting = true;
+
+            if (!next_conn_owner->fs_async)
+                fetch_received_data(next_conn_owner);
+        }
+
+
+        /*
+         * If we haven't received a result for the given node this time,
+         * return with no tuple to give way to other nodes.
+         */
+        if (fsstate->next_tuple >= fsstate->num_tuples)
+        {
+            fsstate->result_ready = fsstate->eof_reached;
+            node->ss.ps.asyncstate =
+                fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+            return ExecClearTuple(slot);
+        }
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
                    slot,
                    InvalidBuffer,
@@ -1438,7 +1595,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1446,6 +1603,9 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1474,9 +1634,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1494,7 +1654,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1502,16 +1662,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* Absorb the ramining result */
+    absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1890,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Deconstruct fdw_private data. */
@@ -1793,6 +1971,8 @@ postgresExecForeignInsert(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1803,14 +1983,14 @@ postgresExecForeignInsert(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1818,10 +1998,10 @@ postgresExecForeignInsert(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1859,6 +2039,8 @@ postgresExecForeignUpdate(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1879,14 +2061,14 @@ postgresExecForeignUpdate(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1894,10 +2076,10 @@ postgresExecForeignUpdate(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1935,6 +2117,8 @@ postgresExecForeignDelete(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    vacate_connection((PgFdwState *)fmstate);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1955,14 +2139,14 @@ postgresExecForeignDelete(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1970,10 +2154,10 @@ postgresExecForeignDelete(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -2020,16 +2204,16 @@ postgresEndForeignModify(EState *estate,
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -2353,7 +2537,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2438,7 +2624,10 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        vacate_connection((PgFdwState *)dmstate);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2485,8 +2674,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* close the target relation. */
     if (dmstate->resultRel)
@@ -2609,6 +2798,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnpriv *connpriv;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2651,6 +2841,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+            vacate_connection(&tmpstate);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3005,11 +3205,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3075,47 +3275,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* The connection should be vacant */
+    Assert(fsstate->s.connpriv->current_owner == NULL);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
 
+    /* I should be the current connection owner */
+    Assert(fsstate->s.connpriv->current_owner == node);
+
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3125,27 +3374,82 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
         PQclear(res);
         res = NULL;
     }
     PG_CATCH();
     {
+        fsstate->s.connpriv->current_owner = NULL;
         if (res)
             PQclear(res);
         PG_RE_THROW();
     }
     PG_END_TRY();
 
+    fsstate->s.connpriv->current_owner = NULL;
+
     MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *owner;
+
+    if (connpriv == NULL || connpriv->current_owner == NULL)
+        return;
+
+    /*
+     * let the current connection owner read the result for the running query
+     */
+    owner = connpriv->current_owner;
+    fetch_received_data(owner);
+
+    /* Clear the waiting list */
+    while (owner)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+        fsstate->last_waiter = NULL;
+        owner = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+    if (owner)
+    {
+        PgFdwScanState *target_state = GetPgFdwScanState(owner);
+        PGconn *conn = target_state->s.conn;
+
+        while(PQisBusy(conn))
+            PQclear(PQgetResult(conn));
+        fsstate->s.connpriv->current_owner = NULL;
+        fsstate->async_waiting = false;
+    }
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3229,7 +3533,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3239,12 +3543,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3252,9 +3556,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3515,9 +3819,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3525,10 +3829,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5007,6 +5311,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->current_owner == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index d37cc88..132367a 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2863549..9ba8135 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1614,25 +1614,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1668,12 +1668,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1732,8 +1732,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2


Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
Hello. This is the new version of $Subject.

But, this is not just a rebased version. On the way fixing
serious conflicts, I refactored patch and I believe this becomes
way readable than the previous shape.

# 0003 lacks changes of postgres_fdw.out now.

- Waiting queue manipulation is moved into new functions. It had
  a bug that the same node can be inserted in the queue more than
  once and it is fixed.

- postgresIterateForeginScan had somewhat a tricky strcuture to
  merge similar procedures thus it cannot be said easy-to-read at
  all. Now it is far simpler and straight-forward looking.

- Still this works only on Append/ForeignScan.

> > The attached PoC patch theoretically has no impact on the normal
> > code paths and just brings gain in async cases.

I performed almost the same test as before but with:

- partition tables
  (There should be no difference with inheritance.)

- added test for fetch_size of 200 and 1000 as long as 100.

  100 unreasonably magnifies the lag by context switching on
  single poor box and the test D/F below. (They became faster by
  about twice by adding a small delay (1000 times of
  clock_gettime()(*1)) just before epoll_wait so that it doesn't
  sleep (I suppose)...)

- Table size of test B is one tenth of the previous size, the
  same to one partition.

*1: The reason for the function is that first I found that the
    queries get way faster by just prefixing by "explain analyze"..

>                            patched(ms)  unpatched(ms)   gain(%)
> A: simple table scan     :  3562.32      3444.81        -3.4
> B: local partitioning    :  1451.25      1604.38         9.5
> C: single remote table   :  8818.92      9297.76         5.1
> D: sharding (single con) :  5966.14      6646.73        10.2
> E: sharding (multi con)  :  1802.25      6515.49        72.3

fetch_size = 100
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   3033.48       2997.44        -1.2
B: local partitioning    :   1405.52       1426.66         1.5
C: single remote table   :   8335.50       8463.22         1.5
D: sharding (single con) :   6862.92       6820.97        -0.6
E: sharding (multi con)  :   2185.84       6733.63        67.5
F: partition (single con):   6818.13       6741.01        -1.1
G: partition (multi con) :   2150.58       6407.46        66.4

fetch_size = 200
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   
B: local partitioning    :   
C: single remote table   :   
D: sharding (single con) :   
E: sharding (multi con)  :   
F: partition (single con):   
G: partition (multi con) :   

fetch_size = 1000
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   3050.31       2980.29        -2.3
B: local partitioning    :   1401.34       1419.54         1.3
C: single remote table   :   8375.4        8445.27         0.8
D: sharding (single con) :   3935.97       4737.84        16.9
E: sharding (multi con)  :   1330.44       4752.87        72.0
F: partition (single con):   3997.63       4747.44        15.8
G: partition (multi con) :   1323.02       4807.72        72.5

Async append doesn't affect non-async path at all so B is
expected to get no degradation. It seems within error.

C and F are the gain when all foreign tables share one connection
and D and G are the gain when every foreign tables has dedicate
connection.

I will repost after filling the blank portion of the tables and
complete regression of the patch next week. Sorry for the
incomplete post.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7ad4210dd20b6672367255492e2b1d95cd90b122 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4deeb..890972b9b8 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7fb8..5457899f2d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5baf01..30edc8e83a 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bce021e100..802b79a660 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -126,6 +126,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48874..838845af01 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                   Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a6e8eb71ab..3c06e4c3f8 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                        Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.16.3

From 0b3b692e677f7fd19f618582412acf9d12231bb2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 100 +++++-----
 src/backend/commands/explain.c                 |  17 ++
 src/backend/executor/Makefile                  |   2 +-
 src/backend/executor/execAsync.c               | 145 ++++++++++++++
 src/backend/executor/nodeAppend.c              | 262 +++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c         |  22 ++-
 src/backend/nodes/copyfuncs.c                  |   2 +
 src/backend/nodes/outfuncs.c                   |   2 +
 src/backend/nodes/readfuncs.c                  |   2 +
 src/backend/optimizer/plan/createplan.c        |  68 ++++++-
 src/backend/postmaster/pgstat.c                |   3 +
 src/backend/utils/adt/ruleutils.c              |   8 +-
 src/include/executor/execAsync.h               |  23 +++
 src/include/executor/executor.h                |   1 +
 src/include/executor/nodeForeignscan.h         |   3 +
 src/include/foreign/fdwapi.h                   |  11 ++
 src/include/nodes/execnodes.h                  |  18 +-
 src/include/nodes/plannodes.h                  |   7 +
 src/include/pgstat.h                           |   3 +-
 19 files changed, 603 insertions(+), 96 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index bb6b1a8fdf..248aa73c0b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6968,12 +6968,13 @@ select * from bar where f1 in (select f1 from foo) for update;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
@@ -7006,12 +7007,13 @@ select * from bar where f1 in (select f1 from foo) for share;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
@@ -7043,9 +7045,8 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
    ->  Hash Join
@@ -7061,12 +7062,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7096,14 +7098,11 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo2_1
+               ->  Async Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
@@ -7123,17 +7122,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo2_1
+                     ->  Async Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -8155,11 +8155,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3; 
   a  |  b  |  c   
@@ -8178,9 +8179,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 1 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(5 rows)
+(6 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a)
WHEREt1.a < 10 ORDER BY 1,2,3;
 
  a | b |  c   
@@ -8200,11 +8202,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8223,11 +8226,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8309,10 +8313,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1
-               ->  Foreign Scan on fpagg_tab_p2
-               ->  Foreign Scan on fpagg_tab_p3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1
+               ->  Async Foreign Scan on fpagg_tab_p2
+               ->  Async Foreign Scan on fpagg_tab_p3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8323,13 +8328,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73d94b7235..09c5327cb4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
                ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                        ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
               ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..8ad2adfe1c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..db477e2cf6
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6bc3e470bf..ed8612dd37 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -81,6 +82,7 @@ struct ParallelAppendState
 #define NO_MATCHING_SUBPLANS        -2
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
     ListCell   *lc;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
+
+    /* choose appropriate version of Exec function */
+    if (node->nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_infos != NIL)
@@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
              */
             if (bms_is_empty(validsubplans))
             {
-                appendstate->as_whichplan = NO_MATCHING_SUBPLANS;
+                appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS;
 
                 /* Mark the first as valid so that it's initialized below */
                 validsubplans = bms_make_singleton(0);
@@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = i = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
     foreach(lc, node->appendplans)
     {
         if (bms_is_member(i, validsubplans))
         {
             Plan       *initNode = (Plan *) lfirst(lc);
+            int            sub_eflags = eflags;
+
+            /* Let async-capable subplans run asynchronously */
+            if (i < node->nasyncplans)
+            {
+                sub_eflags |= EXEC_FLAG_ASYNC;
+                nasyncplans++;
+            }
 
             /*
              * Record the lowest appendplans index which is a valid partial
@@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
             if (i >= node->first_partial_plan && j < firstvalid)
                 firstvalid = j;
 
-            appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+            appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
         }
         i++;
     }
@@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        for (i = 0; i < appendstate->as_nasyncplans; ++i)
+            appendstate->as_needrequest =
+                bms_add_member(appendstate->as_needrequest, i);
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /*
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
         /* Nothing to do if there are no matching subplans */
-        else if (node->as_whichplan == NO_MATCHING_SUBPLANS)
+        else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -307,6 +342,156 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+            Assert(!node->as_syncdone || !bms_is_empty(fired));
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        
+        if (node->as_whichsyncplan < 0)
+        {
+            /*
+             * If no subplan has been chosen, we must choose one before
+             * proceeding.
+             */
+            if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
+                !node->choose_next_subplan(node))
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+            /* Nothing to do if there are no matching subplans */
+            else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan in the appropriate direction. If no
+         * more subplans, return the empty slot set up for us by
+         * ExecInitAppend, unless there are async plans we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -353,6 +538,15 @@ ExecReScanAppend(AppendState *node)
         node->as_valid_subplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -373,7 +567,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -461,7 +655,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -494,7 +688,7 @@ choose_next_subplan_locally(AppendState *node)
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -516,19 +710,19 @@ choose_next_subplan_for_leader(AppendState *node)
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
     /* We should never be called when there are no subplans */
-    Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+    Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans for these parameters
@@ -549,12 +743,12 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
@@ -563,12 +757,12 @@ choose_next_subplan_for_leader(AppendState *node)
          * We needn't pay attention to as_valid_subplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -597,13 +791,13 @@ choose_next_subplan_for_worker(AppendState *node)
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
     /* We should never be called when there are no subplans */
-    Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+    Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans for these parameters then
@@ -625,7 +819,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
@@ -639,7 +833,7 @@ choose_next_subplan_for_worker(AppendState *node)
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
@@ -648,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node)
             nextplan = bms_next_member(node->as_valid_subplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -656,10 +850,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -669,7 +863,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
     pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
                                            pstate->pa_next_plan);
 
@@ -696,8 +890,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index a2a28b7ec2..915deb7080 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
                     (ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *        ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c045a7afe..8304dd5b17 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -246,6 +246,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_infos);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1da9d7ed15..ed655f4ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_infos);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2826cec2f8..fb4ae251de 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1652,6 +1652,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_infos);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0317763f43..eda3420d02 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
                    Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels, List *partpruneinfos);
+                           int nasyncplans,    int referent,
+                           List *tlist,
+                           List *partitioned_rels, List *partpruneinfos);
 static RecursiveUnion *make_recursive_union(List *tlist,
                      Plan *lefttree,
                      Plan *righttree,
@@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                  List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                          GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
     Append       *plan;
     List       *tlist = build_path_tlist(root, &best_path->path);
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     List       *partpruneinfos = NIL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
         /* Must insist that all children return the same tlist */
         subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;
     }
 
     if (enable_partition_pruning &&
@@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
      * parent-rel Vars it'll be asked to emit.
      */
 
-    plan = make_append(subplans, best_path->first_partial_path,
-                       tlist, best_path->partitioned_rels,
-                       partpruneinfos);
+    plan = make_append(list_concat(asyncplans, syncplans),
+                       best_path->first_partial_path, nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels, partpruneinfos);
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels,
-            List *partpruneinfos)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+            int referent, List *tlist,
+            List *partitioned_rels, List *partpruneinfos)
 {
     Append       *node = makeNode(Append);
     Plan       *plan = &node->plan;
@@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan,
     node->appendplans = appendplans;
     node->first_partial_plan = first_partial_plan;
     node->part_prune_infos = partpruneinfos;
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;
+
     return node;
 }
 
@@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..7aef97ca97 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 065238b0fe..fe202cbfea 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
     dpns->planstate = ps;
 
     /*
-     * We special-case Append and MergeAppend to pretend that the first child
+     * We special-case Append and MergeAppend to pretend that a specific child
      * plan is the OUTER referent; we have to interpret OUTER Vars in their
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
      * lists containing references to non-target relations.
      */
     if (IsA(ps, AppendState))
-        dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+    {
+        AppendState *aps = (AppendState *) ps;
+        Append *app = (Append *) ps->plan;
+        dpns->outer_planstate = aps->appendplans[app->referent];
+    }
     else if (IsA(ps, MergeAppendState))
         dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
     else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..5fd67d9004
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a7ea3c7d10..8e9d87669f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS    0x0040    /* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA    0x0080    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be733..67abf8e52e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                 ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index c14eb546c6..c00e9621fb 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -189,6 +194,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -241,6 +247,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7f52cab0..56bfe3f442 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -953,6 +959,9 @@ typedef struct PlanState
      * descriptor, without encoding knowledge about all executor nodes.
      */
     TupleDesc    scandesc;
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1087,14 +1096,20 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
     Bitmapset  *as_valid_subplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1643,6 +1658,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2dda82e66..8a64c037c9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,6 +139,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asyncronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,8 @@ typedef struct Append
      * Mapping details for run-time subplan pruning, one per partitioned_rels
      */
     List       *part_prune_infos;
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..6f4583b46c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.16.3

From 072f6af8a2b394402e753a65569d64668e2cfe86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 100 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 619 ++++++++++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 628 insertions(+), 139 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..da7c826e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 248aa73c0b..bb6b1a8fdf 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6968,13 +6968,12 @@ select * from bar where f1 in (select f1 from foo) for update;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(29 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(23 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
@@ -7007,13 +7006,12 @@ select * from bar where f1 in (select f1 from foo) for share;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(29 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(23 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
@@ -7045,8 +7043,9 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                           ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
    ->  Hash Join
@@ -7062,13 +7061,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(41 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7098,11 +7096,14 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               Async subplans: 2 
-               ->  Async Foreign Scan on public.foo2
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Async Foreign Scan on public.foo2 foo2_1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+               ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
@@ -7122,18 +7123,17 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     Async subplans: 2 
-                     ->  Async Foreign Scan on public.foo2
-                           Output: ROW(foo2.f1), foo2.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Async Foreign Scan on public.foo2 foo2_1
-                           Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
+                     ->  Foreign Scan on public.foo2
+                           Output: ROW(foo2.f1), foo2.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_1
                            Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-(47 rows)
+                     ->  Foreign Scan on public.foo2 foo2_1
+                           Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
+(45 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -8155,12 +8155,11 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(8 rows)
+(7 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8179,10 +8178,9 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         Async subplans: 1 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(6 rows)
+(5 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a)
WHEREt1.a < 10 ORDER BY 1,2,3;
 
  a | b |  c   
@@ -8202,12 +8200,11 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(8 rows)
+(7 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8226,12 +8223,11 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(8 rows)
+(7 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8313,11 +8309,10 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               Async subplans: 3 
-               ->  Async Foreign Scan on fpagg_tab_p1
-               ->  Async Foreign Scan on fpagg_tab_p2
-               ->  Async Foreign Scan on fpagg_tab_p3
-(10 rows)
+               ->  Foreign Scan on fpagg_tab_p1
+               ->  Foreign Scan on fpagg_tab_p2
+               ->  Foreign Scan on fpagg_tab_p3
+(9 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8328,14 +8323,13 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         Async subplans: 3 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(10 rows)
+(9 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 78b0f43ca8..8efbbf95a8 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -150,6 +173,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        inqueue;        /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last waiting node in waiting queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -163,11 +192,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -190,6 +219,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                 RangeTblEntry *target_rte,
                                 Relation target_relation);
@@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                              RelOptInfo *input_rel,
                              RelOptInfo *output_rel,
                              void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                           EquivalenceClass *ec, EquivalenceMember *em,
                           void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                       RangeTblEntry *rte,
@@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->leader = NULL;
+    fsstate->s.connpriv->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->inqueue = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1408,40 +1460,250 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * adds the node to the end of waiter queue
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.connpriv->leader;
+    PgFdwScanState   *leader_state;
+    PgFdwScanState   *last_waiter_state;
+
+    Assert(leader && leader != node);
+
+    /* do nothing if the node is already in the queue */
+    if (fsstate->inqueue)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+    last_waiter_state = GetPgFdwScanState(leader_state->last_waiter);
+    last_waiter_state->waiter = node;
+    leader_state->last_waiter = node;
+    fsstate->inqueue = true;
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *ret = fsstate->waiter;
+
+    Assert(fsstate->s.connpriv->leader = node);
+    
+    if (ret)
+    {
+        PgFdwScanState *retstate = GetPgFdwScanState(ret);
+        fsstate->waiter = NULL;
+        retstate->last_waiter = fsstate->last_waiter;
+        retstate->inqueue = false;
+    }
+
+    fsstate->s.connpriv->leader = ret;
+
+    return ret;
+}
+
+/*
+ * remove the node from waiter queue
+ *
+ * This is a bit different from the two above in the sense that this can
+ * operate on connection leader. The result is absorbed when this is called on
+ * active leader.
+ *
+ * Returns true if the node was found.
+ */
+static inline bool
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.connpriv->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->inqueue)
+        return false;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    /* Remove the leader node */
+    if (leader == node)
+    {
+        ForeignScanState    *next_leader;
+
+        if (leader_state->s.connpriv->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+            
+            leader_state->s.connpriv->busy = false;
+        }
+
+        /* Make the first waiter the leader */
+        if (leader_state->waiter)
+        {
+            PgFdwScanState *next_leader_state;
+
+            next_leader = leader_state->waiter;
+            next_leader_state = GetPgFdwScanState(next_leader);
+
+            leader_state->s.connpriv->leader = next_leader;
+            next_leader_state->last_waiter = leader_state->last_waiter;
+        }
+        leader_state->waiter = NULL;
+
+        return true;
+    }
+
+    /*
+     * Just remove the node in queue
+     *
+     * This function is called on the shutdown path. We don't bother
+     * considering faster way to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+            else
+                leader_state->last_waiter = cur;
+
+            fsstate->inqueue = false;
+
+            return true;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+
+    return false;
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuples slot to indicte EOF.
+ *
+ *        If the node is asynchronous one, clear tuple slot has two meanings.
+ *        If the caller receives clear tuple slot, asyncstate indicates wheter
+ *        the node is EOF (AS_AVAILABLE) or waiting for data to
+ *        come(AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /* finish running query to send my command */
+            if (!fsstate->s.connpriv->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+                
+            request_more_data(node);
+
+            /*
+             * Fetch the result immediately. This executes the next waiter if
+             * any.
+             */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.connpriv->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else if (fsstate->s.connpriv->leader == node)
+        {
+            bool available = true;
+
+            /* Check if the result is available */
+            if (PQisBusy(fsstate->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(fsstate->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* The next waiter is executed automatcically */
+            if (available)
+                fetch_received_data(node);
+        }
+        else if (fsstate->s.connpriv->leader)
+        {
+            /*
+             * Anyone else is waiting on this connection then add this node to
+             * waiting queue.
+             */
+            add_async_waiter(node);
+        }
+    }
 
     /*
-     * Get some more tuples, if we've run out.
+     * If we haven't received a result for the given node this time,
+     * return with no tuple to give way to another node.
      */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        if (fsstate->eof_reached)
+        {
+            fsstate->result_ready = true;
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        }
+        else
+        {
+            fsstate->result_ready = false;
+            node->ss.ps.asyncstate = AS_WAITING;
+        }
+            
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
                    slot,
                    InvalidBuffer,
@@ -1457,7 +1719,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1465,6 +1727,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1493,9 +1757,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1513,7 +1777,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1521,15 +1785,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -1753,6 +2033,9 @@ postgresExecForeignInsert(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1763,14 +2046,14 @@ postgresExecForeignInsert(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1778,10 +2061,10 @@ postgresExecForeignInsert(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1819,6 +2102,9 @@ postgresExecForeignUpdate(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1839,14 +2125,14 @@ postgresExecForeignUpdate(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1854,10 +2140,10 @@ postgresExecForeignUpdate(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1895,6 +2181,9 @@ postgresExecForeignDelete(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1915,14 +2204,14 @@ postgresExecForeignDelete(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1930,10 +2219,10 @@ postgresExecForeignDelete(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -2400,7 +2689,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2485,7 +2776,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2532,8 +2827,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* close the target relation. */
     if (dmstate->resultRel)
@@ -2656,6 +2951,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnpriv *connpriv;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2698,6 +2994,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3061,11 +3369,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3128,50 +3436,121 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.connpriv->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.connpriv->busy);
+    /* must be not-eof */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->busy = true;
+
+    /* Let the node be the leader if it is different from current one */
+    if (leader != node)
+    {
+        /*
+         * If the connection leader exists, insert the node as the connection
+         * leader making the current leader be the first waiter.
+         */
+        if (leader != NULL)
+        {
+            remove_async_node(node);
+            fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+            fsstate->waiter = leader;
+        }
+        fsstate->s.connpriv->leader = node;
+    }
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.connpriv->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3181,26 +3560,76 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
         PQclear(res);
         res = NULL;
     }
     PG_CATCH();
     {
+        fsstate->s.connpriv->busy = false;
+
         if (res)
             PQclear(res);
         PG_RE_THROW();
     }
     PG_END_TRY();
 
+    fsstate->s.connpriv->busy = false;
+
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *leader;
+
+    /* the connection is alrady available */
+    if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy)
+        return;
+
+    /*
+     * let the current connection leader read the result for the running query
+     */
+    leader = connpriv->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3314,7 +3743,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3387,7 +3818,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3397,12 +3828,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3410,9 +3841,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3537,16 +3968,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -3706,9 +4137,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3716,10 +4147,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5203,6 +5634,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection leader. Elsewise
+ * another node on this connection is the leader.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index a5d4011e8d..f344fb7f66 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 231b1e01a5..8ecc903c20 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1741,8 +1741,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.16.3


Re: [HACKERS] asynchronous execution

From
Kyotaro HORIGUCHI
Date:
This gets further refactoring.

At Fri, 11 May 2018 17:45:20 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180511.174520.188681124.horiguchi.kyotaro@lab.ntt.co.jp>
> But, this is not just a rebased version. On the way fixing
> serious conflicts, I refactored patch and I believe this becomes
> way readable than the previous shape.
> 
> - Waiting queue manipulation is moved into new functions. It had
>   a bug that the same node can be inserted in the queue more than
>   once and it is fixed.
> 
> - postgresIterateForeginScan had somewhat a tricky strcuture to
>   merge similar procedures thus it cannot be said easy-to-read at
>   all. Now it is far simpler and straight-forward looking.
> 
> - Still this works only on Append/ForeignScan.

I performed almost the same test (again) as before but with some
new things:

- partition tables (There should be no difference with
  inheritance and it actually looks so.)

- added test for fetch_size of 200 and 1000 as long as 100.

  Fetch size of 100 seems unreasonably magnifies the lag by
  context switching on single poor box for the test D/F
  below. They became faster by about twice by *adding* a small
  delay (1000 times of clock_gettime()(*1)) just before
  epoll_wait. Things would be different on separate machines but
  I'm not sure it really is. I don't find the exact cause nor how
  to avoid it.

*1: The reason for the function is that I found at first that the
    queries get way faster by just prefixing by "explain
    analyze"..

Async append (theoretically) no longer affects non-async path at
all so B is expected to get no degradation. It seems within
error.

C and F are the gain when all foreign tables share one connection
and D and G are the gain when every foreign tables has dedicate
connection.

(previous numbers)
>                            patched(ms)  unpatched(ms)   gain(%)
> A: simple table scan     :  3562.32      3444.81        -3.4
> B: local partitioning    :  1451.25      1604.38         9.5
> C: single remote table   :  8818.92      9297.76         5.1
> D: sharding (single con) :  5966.14      6646.73        10.2
> E: sharding (multi con)  :  1802.25      6515.49        72.3

fetch_size = 100
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   3065.82       3046.82        -0.62
B: local partitioning    :   1393.98       1378.00        -1.16
C: single remote table   :   8499.73       8595.66         1.12
D: sharding (single con) :   9267.85       9251.59        -0.18
E: sharding (multi con)  :   2567.02       9295.22        72.38
F: partition (single con):   9241.08       9060.19        -2.00
G: partition (multi con) :   2548.86       9419.18        72.94

fetch_size = 200
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   3067.08       2999.23        -2.3 
B: local partitioning    :   1392.07       1384.49        -0.5 
C: single remote table   :   8521.72       8505.48        -0.2 
D: sharding (single con) :   6752.81       7076.02         4.6  
E: sharding (multi con)  :   1958.2        7188.02        72.8 
F: partition (single con):   6756.72       7000.72         3.5  
G: partition (multi con) :   1969.8        7228.85        72.8 

fetch_size = 1000
                            patched(ms)  unpatched(ms)   gain(%)
A: simple table scan     :   4547.44       4519.34        -0.62
B: local partitioning    :   2880.66       2739.43        -5.16
C: single remote table   :   8448.04       8572.15         1.45
D: sharding (single con) :   2405.01       5919.31        59.37
E: sharding (multi con)  :   1872.15       5963.04        68.60
F: partition (single con):   2369.08       5960.81        60.26
G: partition (multi con) :   1854.69       5893.65        68.53


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 54f85c159f3feee5ee2dac6daacc7330ec101ed5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4deeb..890972b9b8 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7fb8..5457899f2d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5baf01..30edc8e83a 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bce021e100..802b79a660 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -126,6 +126,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48874..838845af01 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                   Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a6e8eb71ab..3c06e4c3f8 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                        Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.16.3

From 19ff6af521070b8245f4bd04bd535a5286be1509 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH 2/3] infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 145 ++++++++++++++++
 src/backend/executor/nodeAppend.c       | 285 ++++++++++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/nodes/bitmapset.c           |  72 ++++++++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  68 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  18 +-
 src/include/nodes/plannodes.h           |   7 +
 src/include/pgstat.h                    |   3 +-
 20 files changed, 646 insertions(+), 49 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73d94b7235..09c5327cb4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
                ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                        ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
               ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..8ad2adfe1c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..db477e2cf6
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+    pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6bc3e470bf..94fafe72fb 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -81,6 +82,7 @@ struct ParallelAppendState
 #define NO_MATCHING_SUBPLANS        -2
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
     ListCell   *lc;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
+
+    /* choose appropriate version of Exec function */
+    if (node->nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_infos != NIL)
@@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
              */
             if (bms_is_empty(validsubplans))
             {
-                appendstate->as_whichplan = NO_MATCHING_SUBPLANS;
+                appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS;
 
                 /* Mark the first as valid so that it's initialized below */
                 validsubplans = bms_make_singleton(0);
@@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = i = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
     foreach(lc, node->appendplans)
     {
         if (bms_is_member(i, validsubplans))
         {
             Plan       *initNode = (Plan *) lfirst(lc);
+            int            sub_eflags = eflags;
+
+            /* Let async-capable subplans run asynchronously */
+            if (i < node->nasyncplans)
+            {
+                sub_eflags |= EXEC_FLAG_ASYNC;
+                nasyncplans++;
+            }
 
             /*
              * Record the lowest appendplans index which is a valid partial
@@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
             if (i >= node->first_partial_plan && j < firstvalid)
                 firstvalid = j;
 
-            appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+            appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
         }
         i++;
     }
@@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        for (i = 0; i < appendstate->as_nasyncplans; ++i)
+            appendstate->as_needrequest =
+                bms_add_member(appendstate->as_needrequest, i);
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /*
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
         /* Nothing to do if there are no matching subplans */
-        else if (node->as_whichplan == NO_MATCHING_SUBPLANS)
+        else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -307,6 +342,175 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * No subplan fired. This happens when even in normal
+                 * operation where the subnode already prepared results before
+                 * waiting. as_pending_result is storing stale information so
+                 * restart from the beginning.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+                if (subnode->asyncstate == AS_AVAILABLE)
+                {
+                    if (!TupIsNull(slot))
+                    {
+                        node->as_asyncresult[node->as_nasyncresult++] = slot;
+                        node->as_needrequest =
+                            bms_add_member(node->as_needrequest, i);
+                    }
+                    node->as_pending_async =
+                        bms_del_member(node->as_pending_async, i);
+                }
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        
+        if (node->as_whichsyncplan < 0)
+        {
+            /*
+             * If no subplan has been chosen, we must choose one before
+             * proceeding.
+             */
+            if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
+                !node->choose_next_subplan(node))
+            {
+                node->as_syncdone = true;
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+
+            /* Nothing to do if there are no matching subplans */
+            else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
+            {
+                node->as_syncdone = true;
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -353,6 +557,15 @@ ExecReScanAppend(AppendState *node)
         node->as_valid_subplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+    {
+        ExecShutdownNode(node->appendplans[i]);
+        node->as_needrequest = bms_add_member(node->as_needrequest, i);
+    }
+    node->as_nasyncresult = 0;
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -373,7 +586,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -461,7 +674,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -480,6 +693,10 @@ choose_next_subplan_locally(AppendState *node)
             node->as_valid_subplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_subplans, 0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -494,7 +711,7 @@ choose_next_subplan_locally(AppendState *node)
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -516,19 +733,19 @@ choose_next_subplan_for_leader(AppendState *node)
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
     /* We should never be called when there are no subplans */
-    Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+    Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans for these parameters
@@ -549,12 +766,12 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
@@ -563,12 +780,12 @@ choose_next_subplan_for_leader(AppendState *node)
          * We needn't pay attention to as_valid_subplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -597,13 +814,13 @@ choose_next_subplan_for_worker(AppendState *node)
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
     /* We should never be called when there are no subplans */
-    Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+    Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans for these parameters then
@@ -625,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
@@ -639,7 +856,7 @@ choose_next_subplan_for_worker(AppendState *node)
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
@@ -648,7 +865,7 @@ choose_next_subplan_for_worker(AppendState *node)
             nextplan = bms_next_member(node->as_valid_subplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -656,10 +873,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -669,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
     pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
                                            pstate->pa_next_plan);
 
@@ -696,8 +913,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index a2a28b7ec2..915deb7080 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
                     (ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *        ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 9bf9a29d6b..b2ab879d49 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -922,6 +922,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c045a7afe..8304dd5b17 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -246,6 +246,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_infos);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1da9d7ed15..ed655f4ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_infos);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2826cec2f8..fb4ae251de 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1652,6 +1652,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_infos);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0317763f43..eda3420d02 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
                    Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels, List *partpruneinfos);
+                           int nasyncplans,    int referent,
+                           List *tlist,
+                           List *partitioned_rels, List *partpruneinfos);
 static RecursiveUnion *make_recursive_union(List *tlist,
                      Plan *lefttree,
                      Plan *righttree,
@@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                  List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                          GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
     Append       *plan;
     List       *tlist = build_path_tlist(root, &best_path->path);
-    List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     List       *partpruneinfos = NIL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
         /* Must insist that all children return the same tlist */
         subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+            syncplans = lappend(syncplans, subplan);
+
+        first = false;
     }
 
     if (enable_partition_pruning &&
@@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
      * parent-rel Vars it'll be asked to emit.
      */
 
-    plan = make_append(subplans, best_path->first_partial_path,
-                       tlist, best_path->partitioned_rels,
-                       partpruneinfos);
+    plan = make_append(list_concat(asyncplans, syncplans),
+                       best_path->first_partial_path, nasyncplans,
+                       referent_is_sync ? nasyncplans : 0, tlist,
+                       best_path->partitioned_rels, partpruneinfos);

     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-            List *tlist, List *partitioned_rels,
-            List *partpruneinfos)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+            int referent, List *tlist,
+            List *partitioned_rels, List *partpruneinfos)
 {
     Append       *node = makeNode(Append);
     Plan       *plan = &node->plan;
@@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan,
     node->appendplans = appendplans;
     node->first_partial_plan = first_partial_plan;
     node->part_prune_infos = partpruneinfos;
+    node->nasyncplans = nasyncplans;
+    node->referent = referent;
+
     return node;
 }
 
@@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..7aef97ca97 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 065238b0fe..fe202cbfea 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
     dpns->planstate = ps;
 
     /*
-     * We special-case Append and MergeAppend to pretend that the first child
+     * We special-case Append and MergeAppend to pretend that a specific child
      * plan is the OUTER referent; we have to interpret OUTER Vars in their
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
      * lists containing references to non-target relations.
      */
     if (IsA(ps, AppendState))
-        dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+    {
+        AppendState *aps = (AppendState *) ps;
+        Append *app = (Append *) ps->plan;
+        dpns->outer_planstate = aps->appendplans[app->referent];
+    }
     else if (IsA(ps, MergeAppendState))
         dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
     else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..5fd67d9004
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a7ea3c7d10..8e9d87669f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS        0x0020    /* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS    0x0040    /* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA    0x0080    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0100    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be733..67abf8e52e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                 ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index c14eb546c6..c00e9621fb 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -189,6 +194,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -241,6 +247,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index b6f1a9e6e5..41f0927934 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -94,6 +94,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7f52cab0..56bfe3f442 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -953,6 +959,9 @@ typedef struct PlanState
      * descriptor, without encoding knowledge about all executor nodes.
      */
     TupleDesc    scandesc;
+
+    AsyncState    asyncstate;
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1087,14 +1096,20 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
     Bitmapset  *as_valid_subplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1643,6 +1658,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2dda82e66..8a64c037c9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,6 +139,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asyncronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,8 @@ typedef struct Append
      * Mapping details for run-time subplan pruning, one per partitioned_rels
      */
     List       *part_prune_infos;
+    int            nasyncplans;    /* # of async plans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..6f4583b46c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.16.3

From 72120b5c2b0775d33186dec7d4fc206e63094c20 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 +
 contrib/postgres_fdw/expected/postgres_fdw.out | 198 ++++----
 contrib/postgres_fdw/postgres_fdw.c            | 633 ++++++++++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 708 insertions(+), 171 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..da7c826e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index bb6b1a8fdf..cddc207c04 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6793,7 +6793,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6821,7 +6821,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6849,7 +6849,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6877,7 +6877,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6947,35 +6947,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6985,35 +6991,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7043,11 +7055,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7061,12 +7074,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7096,16 +7110,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo2_1
+               ->  Async Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7123,17 +7138,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo2_1
+                     ->  Async Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7283,27 +7299,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8155,11 +8177,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8178,9 +8201,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 1 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(5 rows)
+(6 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a)
WHEREt1.a < 10 ORDER BY 1,2,3;
 
  a | b |  c   
@@ -8200,11 +8224,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8223,11 +8248,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8309,10 +8335,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1
-               ->  Foreign Scan on fpagg_tab_p2
-               ->  Foreign Scan on fpagg_tab_p3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1
+               ->  Async Foreign Scan on fpagg_tab_p2
+               ->  Async Foreign Scan on fpagg_tab_p3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8323,13 +8350,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 78b0f43ca8..51d19cc421 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;            /* connection for the scan */
+    PgFdwConnpriv *connpriv;    /* connection private memory */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -150,6 +173,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        inqueue;        /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last waiting node in waiting queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -163,11 +192,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -190,6 +219,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                 RangeTblEntry *target_rte,
                                 Relation target_relation);
@@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                              RelOptInfo *input_rel,
                              RelOptInfo *output_rel,
                              void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                           EquivalenceClass *ec, EquivalenceMember *em,
                           void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                       RangeTblEntry *rte,
@@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+    fsstate->s.connpriv->leader = NULL;
+    fsstate->s.connpriv->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->inqueue = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1408,40 +1460,258 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * adds the node to the end of waiter queue. Immediately starts the node if no
+ * node is running
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.connpriv->leader;
+
+    /* do nothing if the node is already in the queue or already eof'ed */
+    if (leader == node || fsstate->inqueue || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* immediately send request if not busy */
+        request_more_data(node);
+    }
+    else
+    {
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->inqueue = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *ret = fsstate->waiter;
+
+    Assert(fsstate->s.connpriv->leader = node);
+    
+    if (ret)
+    {
+        PgFdwScanState *retstate = GetPgFdwScanState(ret);
+        fsstate->waiter = NULL;
+        retstate->last_waiter = fsstate->last_waiter;
+        retstate->inqueue = false;
+    }
+
+    fsstate->s.connpriv->leader = ret;
+
+    return ret;
+}
+
+/*
+ * remove the node from waiter queue
+ *
+ * This is a bit different from the two above in the sense that this can
+ * operate on connection leader. The result is absorbed when this is called on
+ * active leader.
+ *
+ * Returns true if the node was found.
+ */
+static inline bool
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.connpriv->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->inqueue)
+        return false;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    /* Remove the leader node */
+    if (leader == node)
+    {
+        ForeignScanState    *next_leader;
+
+        if (leader_state->s.connpriv->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+            
+            leader_state->s.connpriv->busy = false;
+        }
+
+        /* Make the first waiter the leader */
+        if (leader_state->waiter)
+        {
+            PgFdwScanState *next_leader_state;
+
+            next_leader = leader_state->waiter;
+            next_leader_state = GetPgFdwScanState(next_leader);
+
+            leader_state->s.connpriv->leader = next_leader;
+            next_leader_state->last_waiter = leader_state->last_waiter;
+        }
+        leader_state->waiter = NULL;
+
+        return true;
+    }
+
+    /*
+     * Just remove the node in queue
+     *
+     * This function is called on the shutdown path. We don't bother
+     * considering faster way to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+            else
+                leader_state->last_waiter = cur;
+
+            fsstate->inqueue = false;
+
+            return true;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+
+    return false;
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuples slot to indicte EOF.
+ *
+ *        If the node is asynchronous one, clear tuple slot has two meanings.
+ *        If the caller receives clear tuple slot, asyncstate indicates wheter
+ *        the node is EOF (AS_AVAILABLE) or waiting for data to
+ *        come(AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /* finish running query to send my command */
+            if (!fsstate->s.connpriv->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /*
+             * Fetch the result immediately. This executes the next waiter if
+             * any.
+             */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.connpriv->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* This connection is busy */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.connpriv->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* Check if the result is immediately available */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* The next waiter is executed automatcically */
+            if (available)
+                fetch_received_data(leader);
+
+            /* add the requested node */
+            add_async_waiter(node);
+
+            /* add the previous leader */
+            add_async_waiter(leader);
+        }
+    }
 
     /*
-     * Get some more tuples, if we've run out.
+     * If we haven't received a result for the given node this time,
+     * return with no tuple to give way to another node.
      */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        if (fsstate->eof_reached)
+        {
+            fsstate->result_ready = true;
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        }
+        else
+        {
+            fsstate->result_ready = false;
+            node->ss.ps.asyncstate = AS_WAITING;
+        }
+            
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
                    slot,
                    InvalidBuffer,
@@ -1457,7 +1727,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1465,6 +1735,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1493,9 +1765,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1513,7 +1785,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1521,15 +1793,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -1753,6 +2041,9 @@ postgresExecForeignInsert(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1763,14 +2054,14 @@ postgresExecForeignInsert(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1778,10 +2069,10 @@ postgresExecForeignInsert(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1819,6 +2110,9 @@ postgresExecForeignUpdate(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1839,14 +2133,14 @@ postgresExecForeignUpdate(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1854,10 +2148,10 @@ postgresExecForeignUpdate(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -1895,6 +2189,9 @@ postgresExecForeignDelete(EState *estate,
     PGresult   *res;
     int            n_rows;
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -1915,14 +2212,14 @@ postgresExecForeignDelete(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -1930,10 +2227,10 @@ postgresExecForeignDelete(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -2400,7 +2697,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2485,7 +2784,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2532,8 +2835,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* close the target relation. */
     if (dmstate->resultRel)
@@ -2656,6 +2959,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnpriv *connpriv;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2698,6 +3002,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        connpriv = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnpriv));
+        if (connpriv)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.connpriv = connpriv;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3061,11 +3377,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3128,50 +3444,127 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.connpriv->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.connpriv->busy);
+    /* must be not-eof */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.connpriv->busy = true;
+
+    /* Let the node be the leader if it is different from current one */
+    if (leader != node)
+    {
+        /*
+         * If the connection leader exists, insert the node as the connection
+         * leader making the current leader be the first waiter.
+         */
+        if (leader != NULL)
+        {
+            remove_async_node(node);
+            fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+            fsstate->waiter = leader;
+        }
+        else
+        {
+            fsstate->last_waiter = node;
+            fsstate->waiter = NULL;
+        }
+
+        fsstate->s.connpriv->leader = node;
+    }
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.connpriv->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3181,26 +3574,76 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
         PQclear(res);
         res = NULL;
     }
     PG_CATCH();
     {
+        fsstate->s.connpriv->busy = false;
+
         if (res)
             PQclear(res);
         PG_RE_THROW();
     }
     PG_END_TRY();
 
+    fsstate->s.connpriv->busy = false;
+
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnpriv *connpriv = fdwstate->connpriv;
+    ForeignScanState *leader;
+
+    /* the connection is alrady available */
+    if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy)
+        return;
+
+    /*
+     * let the current connection leader read the result for the running query
+     */
+    leader = connpriv->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3314,7 +3757,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.connpriv = (PgFdwConnpriv *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3387,7 +3832,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3397,12 +3842,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3410,9 +3855,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3537,16 +3982,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -3706,9 +4151,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3716,10 +4161,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5203,6 +5648,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection leader. Elsewise
+ * another node on this connection is the leader.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* If the caller didn't reinit, this event is already in event set */
+    if (!reinit)
+        return true;
+
+    if (fsstate->s.connpriv->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index a5d4011e8d..f344fb7f66 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 231b1e01a5..8ecc903c20 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1741,8 +1741,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.16.3