Thread: Asynchronous Append on postgres_fdw nodes.

Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello, this is a follow-on of [1] and [2].

Currently the executor visits execution nodes one-by-one.  Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query.  For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].

I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].

The patchset consists of three parts.

- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
  The async feature uses WaitEvent, and it needs to be released on
  error.  This patch makes it possible to register WaitEvent to
  resowner to handle that case..

- v2-0002-infrastructure-for-asynchronous-execution.patch
  It povides an abstraction layer of asynchronous behavior
  (execAsync). Then adds ExecAppend, another version of ExecAppend,
  that handles "async-capable" subnodes asynchronously. Also it
  contains planner part that makes planner aware of "async-capable"
  and "async-aware" path nodes.

- v2-0003-async-postgres_fdw.patch
  The "async-capable" postgres_fdw.  It accelerates multiple
  postgres_fdw nodes on a single connection case as well as
  postgres_fdw nodes on dedicate connections.

regards.

[1] https://www.postgresql.org/message-id/2020012917585385831113%40highgo.ca
[2] https://www.postgresql.org/message-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp
[3] https://www.postgresql.org/message-id/20191205181217.GA12895%40momjian.us

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 22099ed9a6107b92c8e2b95ff1d199832810629c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v2 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 +++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 046ca5c6c7..9c10bd5fcf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 3c39e48825..035e83f4f8 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -701,6 +714,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1346,3 +1361,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.2

From 8d2fd1f17f8e38e1106017fe6327fbeaec3bcd52 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v2 2/3] infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/plan/createplan.c |  66 ++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/syslogger.c      |   2 +-
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 703 insertions(+), 63 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..daccad8268 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -84,6 +84,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
                            ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1343,6 +1344,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1916,6 +1919,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2247,6 +2255,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..8a2d6e9961 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..b5a8adfaf8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Rery fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 54ad62bb7f..59205e5da6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..130b4c7b85 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 551ce6c41c..1708337177 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1571,6 +1571,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1671,6 +1672,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..8bb5294155 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -292,6 +292,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                                      List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                                              GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1069,6 +1070,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1077,6 +1083,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1206,9 +1215,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * chidlren in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1236,7 +1272,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1244,6 +1280,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6841,3 +6879,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 462b4d7e06..4a812bed24 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3851,6 +3851,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index cf7b535e4e..32c1e51128 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -306,7 +306,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 158784474d..70489c7c4c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4573,10 +4573,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 81fdfa4add..e5d5e9726d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..7778f5ddc2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -936,6 +936,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1024,6 +1030,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution sutff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1219,14 +1230,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* unreturned results of async plans */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1794,6 +1812,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 99835ae2e4..fa4ddbb400 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asyncronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3a65a51696..1bc713254c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -853,7 +853,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.2

From 6312e1a42c5a89642bb0ec1b7373e5ce4f8e0326 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v2 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 222 ++++---
 contrib/postgres_fdw/postgres_fdw.c           | 607 ++++++++++++++++--
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 703 insertions(+), 176 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index e45647f3ea..2184f7745a 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not any. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 62c2697920..e11e0d40a7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
                      Output: ROW(foo_1.f1), foo_1.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_2
-                     Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
+               ->  Async Foreign Scan on public.foo2 foo_3
                      Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_2
+                     Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
                            Output: ROW(foo_1.f1), foo_1.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_2
-                           Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
+                     ->  Async Foreign Scan on public.foo2 foo_3
                            Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_2
+                           Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2175dff824..7b34afa119 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,28 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +160,7 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
+    bool        result_ready;
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        run_async;        /* true if run asynchronously */
+    bool        inqueue;        /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last waiting node in waiting queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->run_async = false;
+    fsstate->inqueue = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1487,40 +1539,249 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue the node if it doesn't in the queue. Immediately starts the node if
+ * the connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /* do nothing if the node is already in the queue or already eof'ed */
+    if (leader == node || fsstate->inqueue || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* immediately send request if not busy */
+        request_more_data(node);
+    }
+    else
+    {
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->inqueue = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *ret = fsstate->waiter;
+
+    Assert(fsstate->s.commonstate->leader = node);
+
+    if (ret)
+    {
+        PgFdwScanState *retstate = GetPgFdwScanState(ret);
+        fsstate->waiter = NULL;
+        retstate->last_waiter = fsstate->last_waiter;
+        retstate->inqueue = false;
+    }
+
+    fsstate->s.commonstate->leader = ret;
+
+    return ret;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Results are cleared before removing leader if it is busy.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->inqueue)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    /* Remove the leader node */
+    if (leader == node)
+    {
+        ForeignScanState    *next_leader;
+
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        /* Make the first waiter the leader */
+        if (leader_state->waiter)
+        {
+            PgFdwScanState *next_leader_state;
+
+            next_leader = leader_state->waiter;
+            next_leader_state = GetPgFdwScanState(next_leader);
+
+            leader_state->s.commonstate->leader = next_leader;
+            next_leader_state->last_waiter = leader_state->last_waiter;
+        }
+        leader_state->waiter = NULL;
+
+        return;
+    }
+
+    /*
+     * Just remove the node in queue
+     *
+     * This function is called on the shutdown path. We don't bother
+     * considering faster way to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+            else
+                leader_state->last_waiter = cur;
+
+            fsstate->inqueue = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check asyncstate to tell if all tuples received (AS_AVAILABLE)
+ *        or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /* finish the running query before sending command for this node */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* Check if the result is immediately available */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* Fetch the leader's data if any */
+            if (available)
+                fetch_received_data(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /* queue the previous leader for the next request if needed */
+            add_async_waiter(leader);
+        }
+    }
 
-    /*
-     * Get some more tuples, if we've run out.
-     */
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+        {
+            fsstate->result_ready = true;
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        }
+        else
+        {
+            fsstate->result_ready = false;
+            node->ss.ps.asyncstate = AS_WAITING;
+        }
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    fsstate->result_ready = true;
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1535,7 +1796,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1543,6 +1804,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1571,9 +1834,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1591,7 +1854,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1599,15 +1862,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2651,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2738,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2789,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2703,6 +2988,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2747,6 +3033,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3317,11 +3615,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3384,50 +3682,127 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* Let the node be the leader if it is different from current one */
+    if (leader != node)
+    {
+        /*
+         * If the connection leader exists, insert the node as the connection
+         * leader making the current leader be the first waiter.
+         */
+        if (leader != NULL)
+        {
+            remove_async_node(node);
+            fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+            fsstate->waiter = leader;
+        }
+        else
+        {
+            fsstate->last_waiter = node;
+            fsstate->waiter = NULL;
+        }
+
+        fsstate->s.commonstate->leader = node;
+    }
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* move the remaining tuples to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
         snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
                  fsstate->fetch_size, fsstate->cursor_number);
 
-        res = pgfdw_exec_query(conn, sql);
+        res = pgfdw_get_result(conn, sql);
         /* On error, report the original query, not the FETCH. */
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3437,22 +3812,75 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
+
+        PQclear(res);
+        res = NULL;
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    fsstate->s.commonstate->busy = false;
+
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    /* the connection is alrady available */
+    if (commonstate == NULL || commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read the result for the running query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3994,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3653,6 +4083,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3680,14 +4113,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3695,10 +4128,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3734,7 +4167,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3744,12 +4177,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3757,9 +4190,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3888,16 +4321,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4056,9 +4489,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4066,10 +4499,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5560,6 +5993,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.2


Re: Asynchronous Append on postgres_fdw nodes.

From
David Steele
Date:
On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
> Hello, this is a follow-on of [1] and [2].
> 
> Currently the executor visits execution nodes one-by-one.  Considering
> sharding, Append on multiple postgres_fdw nodes can work
> simultaneously and that can largely shorten the respons of the whole
> query.  For example, aggregations that can be pushed-down to remote
> would be accelerated by the number of remote servers. Even other than
> such an extreme case, collecting tuples from multiple servers also can
> be accelerated by tens of percent [2].
> 
> I have suspended the work waiting asyncrohous or push-up executor to
> come but the mood seems inclining toward doing that before that to
> come [3].
> 
> The patchset consists of three parts.

Are these improvements targeted at PG13 or PG14?  This seems to be a 
pretty big change for the last CF of PG13.

Regards,
-- 
-David
david@pgmasters.net



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Wed, 4 Mar 2020 09:56:55 -0500, David Steele <david@pgmasters.net> wrote in 
> On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
> > Hello, this is a follow-on of [1] and [2].
> > Currently the executor visits execution nodes one-by-one.  Considering
> > sharding, Append on multiple postgres_fdw nodes can work
> > simultaneously and that can largely shorten the respons of the whole
> > query.  For example, aggregations that can be pushed-down to remote
> > would be accelerated by the number of remote servers. Even other than
> > such an extreme case, collecting tuples from multiple servers also can
> > be accelerated by tens of percent [2].
> > I have suspended the work waiting asyncrohous or push-up executor to
> > come but the mood seems inclining toward doing that before that to
> > come [3].
> > The patchset consists of three parts.
> 
> Are these improvements targeted at PG13 or PG14?  This seems to be a
> pretty big change for the last CF of PG13.

It is targeted at PG14.  As we have the target version in CF-app now,
I marked it as targetting PG14.

Thank you for the suggestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Thomas Munro
Date:
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
>   The async feature uses WaitEvent, and it needs to be released on
>   error.  This patch makes it possible to register WaitEvent to
>   resowner to handle that case..

+1

> - v2-0002-infrastructure-for-asynchronous-execution.patch
>   It povides an abstraction layer of asynchronous behavior
>   (execAsync). Then adds ExecAppend, another version of ExecAppend,
>   that handles "async-capable" subnodes asynchronously. Also it
>   contains planner part that makes planner aware of "async-capable"
>   and "async-aware" path nodes.

> This patch add an infrastructure for asynchronous execution. As a PoC
> this makes only Append capable to handle asynchronously executable
> subnodes.

What other nodes do you think could be async aware?  I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.

+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;

Cool.  No extra cost for people not using the new feature.

+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)

So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState.  The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it.  Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?

+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
...
+    if (refindsize < n)
...
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);

This seems a bit strange.  Why not just put the pointer in the plan
state?  I suppose you want to avoid allocating a new buffer for every
query.  Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.

+    wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+    FreeWaitEventSet(wes);

BTW, just as an FYI, I am proposing[1] to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed.  I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that.  In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
My goal is to generate the minimum number of systems calls.  I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close()  for every
wait.  But I suppose there is an argument that it's more complication
than it's worth.

[1] https://commitfest.postgresql.org/27/2452/



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thank you for the comment.

At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
> >   The async feature uses WaitEvent, and it needs to be released on
> >   error.  This patch makes it possible to register WaitEvent to
> >   resowner to handle that case..
> 
> +1
> 
> > - v2-0002-infrastructure-for-asynchronous-execution.patch
> >   It povides an abstraction layer of asynchronous behavior
> >   (execAsync). Then adds ExecAppend, another version of ExecAppend,
> >   that handles "async-capable" subnodes asynchronously. Also it
> >   contains planner part that makes planner aware of "async-capable"
> >   and "async-aware" path nodes.
> 
> > This patch add an infrastructure for asynchronous execution. As a PoC
> > this makes only Append capable to handle asynchronously executable
> > subnodes.
> 
> What other nodes do you think could be async aware?  I suppose you
> would teach joins to pass through the async support of their children,
> and then you could make partition-wise join work like that.

An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.

Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered.  I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.

> +    /* choose appropriate version of Exec function */
> +    if (appendstate->as_nasyncplans == 0)
> +        appendstate->ps.ExecProcNode = ExecAppend;
> +    else
> +        appendstate->ps.ExecProcNode = ExecAppendAsync;
> 
> Cool.  No extra cost for people not using the new feature.

It creates some duplicate code but I agree on the performance
perspective.

> +        slot = ExecProcNode(subnode);
> +        if (subnode->asyncstate == AS_AVAILABLE)
> 
> So, now when you execute a node, you get a result AND you get some
> information that you access by reaching into the child node's
> PlanState.  The ExecProcNode() interface is extremely limiting, but
> I'm not sure if this is the right way to extend it.  Maybe
> ExecAsyncProcNode() with a wide enough interface to do the job?

Sounds resonable and seems easy to do.

> +Bitmapset *
> +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
> +{
> +    static int *refind = NULL;
> +    static int refindsize = 0;
> ...
> +    if (refindsize < n)
> ...
> +            static ExecAsync_mcbarg mcb_arg =
> +                { &refind, &refindsize };
> +            static MemoryContextCallback mcb =
> +                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
> ...
> +            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
> 
> This seems a bit strange.  Why not just put the pointer in the plan
> state?  I suppose you want to avoid allocating a new buffer for every
> query.  Perhaps you could fix that by having a small fixed-size buffer
> in the PlanState to cover common cases and allocating a larger one in
> a per-query memory context if that one is too small, or just not
> worrying about it and allocating every time since you know the desired
> size.

The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode.  So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.


> +    wes = CreateWaitEventSet(TopTransactionContext,
> TopTransactionResourceOwner, n);
> ...
> +    FreeWaitEventSet(wes);
> 
> BTW, just as an FYI, I am proposing[1] to add support for
> RemoveWaitEvent(), so that you could have a single WaitEventSet for
> the lifetime of the executor node, and then add and remove sockets
> only as needed.  I'm hoping to commit that for PG13, if there are no
> objections or better ideas soon, because it's useful for some other
> places where we currently create and destroy WaitEventSets frequently.

Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)

> One complication when working with long-lived WaitEventSet objects is
> that libpq (or some other thing used by some other hypothetical
> async-capable FDW) is free to close and reopen its socket whenever it
> wants, so you need a way to know when it has done that.  In that patch
> set I added pqSocketChangeCount() so that you can see when pgSocket()
> refers to a new socket (even if the file descriptor number is the same
> by coincidence), but that imposes some book-keeping duties on the
> caller, and now I'm wondering how that would look in your patch set.

As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.

> My goal is to generate the minimum number of systems calls.  I think
> it would be nice if a 1000-shard query only calls epoll_ctl() when a
> child node needs to be added or removed from the set, not
> epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close()  for every
> wait.  But I suppose there is an argument that it's more complication
> than it's worth.
> 
> [1] https://commitfest.postgresql.org/27/2452/

I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
movead li
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I have tested the feature and it shows great performance in queries
which have small amount result compared with base scan amount.

Re: Asynchronous Append on postgres_fdw nodes.

From
movead li
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

I occur a strange issue when a exec 'make installcheck-world', it is:

##########################################################
...
============== running regression test queries        ==============
test adminpack                    ... FAILED       60 ms

======================
 1 of 1 tests failed. 
======================

The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs".  A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################

And the content in 'contrib/adminpack/regression.out' is:
##########################################################
SELECT pg_file_write('/tmp/test_file0', 'test0', false);
 ERROR:  absolute path not allowed
 SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
- pg_file_write 
----------------
-             5
-(1 row)
-
+ERROR:  reference to parent directory ("..") not allowed
 SELECT pg_file_write(current_setting('data_directory') || '/../test_file4', 'test4', false);
 ERROR:  reference to parent directory ("..") not allowed
 RESET ROLE;
@@ -149,7 +145,7 @@
 SELECT pg_file_unlink('test_file4');
  pg_file_unlink 
 ----------------
- t
+ f
 (1 row)
##########################################################

However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.

The new status of this patch is: Waiting on Author

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello. Thank you for testing.

At Tue, 10 Mar 2020 05:13:42 +0000, movead li <movead.li@highgo.ca> wrote in 
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            not tested
> 
> I occur a strange issue when a exec 'make installcheck-world', it is:

I had't done that.. Bud it worked for me.

> ##########################################################
> ...
> ============== running regression test queries        ==============
> test adminpack                    ... FAILED       60 ms
> 
> ======================
>  1 of 1 tests failed. 
> ======================
> 
> The differences that caused some tests to fail can be viewed in the
> file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs".  A copy of the test summary that you see
> above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
> ...
> ##########################################################
> 
> And the content in 'contrib/adminpack/regression.out' is:

I don't see that file. Maybe regression.diff?

> ##########################################################
> SELECT pg_file_write('/tmp/test_file0', 'test0', false);
>  ERROR:  absolute path not allowed
>  SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
> - pg_file_write 
> ----------------
> -             5
> -(1 row)
> -
> +ERROR:  reference to parent directory ("..") not allowed

It seems to me that you are setting a path containing ".." to PGDATA.

> However the issue does not occur when I do a 'make check-world'.
> And it doesn't occur when I test the 'make installcheck-world' without the patch.

check-world doesn't use path containing ".." as PGDATA.

> The new status of this patch is: Waiting on Author

Thanks for noticing that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Re: Asynchronous Append on postgres_fdw nodes.

From
"movead.li@highgo.ca"
Date:

>It seems to me that you are setting a path containing ".." to PGDATA.
Thanks point it for me.


Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Asynchronous Append on postgres_fdw nodes.

From
movead li
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
https://www.postgresql.org/message-id/CA%2B9bhCK7chd0qx%2Bmny%2BU9xaOs2FDNJ7RaxG4%3D9rpgT6oAKBgWA%40mail.gmail.com

The new status of this patch is: Ready for Committer

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
Hi,

On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:

> I redo the make installcheck-world as Kyotaro Horiguchi point out and the
> result nothing wrong. And I think the patch is good in feature and performance
> here is the test result thread I made before:
> https://www.postgresql.org/message-id/CA%2B9bhCK7chd0qx%2Bmny%2BU9xaOs2FDNJ7RaxG4%3D9rpgT6oAKBgWA%40mail.gmail.com
>
> The new status of this patch is: Ready for Committer

As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status.  I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though.  I'd like to work on this for PG14 if I have
time.

Thanks!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 3/30/20 1:15 PM, Etsuro Fujita wrote:
> Hi,
> 
> On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:
> 
>> I redo the make installcheck-world as Kyotaro Horiguchi point out and the
>> result nothing wrong. And I think the patch is good in feature and performance
>> here is the test result thread I made before:
>> https://www.postgresql.org/message-id/CA%2B9bhCK7chd0qx%2Bmny%2BU9xaOs2FDNJ7RaxG4%3D9rpgT6oAKBgWA%40mail.gmail.com
>>
>> The new status of this patch is: Ready for Committer
> 
> As discussed upthread, this is a material for PG14, so I moved this to
> the next commitfest, keeping the same status.  I've not looked at the
> patch in any detail yet, so I'm not sure that that is the right status
> for the patch, though.  I'd like to work on this for PG14 if I have
> time.

Hi,
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello, Andrey.

At Wed, 3 Jun 2020 15:00:06 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in 
> This patch no longer applies cleanly.
> In addition, code comments contain spelling errors.

Sure. Thaks for noticing of them and sorry for the many typos.
Additional item in WaitEventIPC conflicted with this.


I found the following typos.

connection.c:
  s/Rerturns/Returns/
postgres-fdw.c:
  s/Retrive/Retrieve/
  s/ForeginScanState/ForeignScanState/
  s/manipuration/manipulation/
  s/asyncstate/async state/
  s/alrady/already/

nodeAppend.c:  
  s/Rery/Retry/

createplan.c:
  s/chidlren/children/

resowner.c:
  s/identier/identifier/ X 2

execnodes.h:
  s/sutff/stuff/

plannodes.h:
  s/asyncronous/asynchronous/
  
  
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From db231fa99da5954b52e195f6af800c0f9b991ed4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v3 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 +++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 05df5017c4..a8b52cd381 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.2

From ced3307f27f01e657499ae6ef4436efaa5e350e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v3 2/3] infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/plan/createplan.c |  66 ++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/syslogger.c      |   2 +-
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/backend/utils/resowner/resowner.c   |   4 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/pgstat.h                    |   3 +-
 22 files changed, 705 insertions(+), 65 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index efd7201d61..708e9ed546 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
                                        List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1969,6 +1972,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2322,6 +2330,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Retry fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 744eed187d..ba18dd88a8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -300,6 +300,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
                                      List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
                                              GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1082,6 +1083,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1096,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1228,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * children in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1249,7 +1285,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1257,6 +1293,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -7016,3 +7054,27 @@ is_projection_capable_plan(Plan *plan)
     }
     return true;
 }
+
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..79a2562454 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3882,6 +3882,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_XACT_GROUP_UPDATE:
             event_name = "XactGroupUpdate";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 076c3c019f..f7b5587d7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4584,10 +4584,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
 ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
 PrintWESLeakWarning(WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..cd50494c74 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution stuff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* results of each async plan */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asynchronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..2259910637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
     WAIT_EVENT_SYNC_REP,
-    WAIT_EVENT_XACT_GROUP_UPDATE
+    WAIT_EVENT_XACT_GROUP_UPDATE,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.2

From 4c207a7901f0a9d05aacb5ce46a7f1daa83ce474 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v3 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 222 ++++---
 contrib/postgres_fdw/postgres_fdw.c           | 603 +++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 694 insertions(+), 181 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not yet. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..29aa09db8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
+                     Output: ROW(foo_1.f1), foo_1.f1
+                     Remote SQL: SELECT f1 FROM public.loct1
+               ->  Async Foreign Scan on public.foo2 foo_3
+                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo
                      Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
-                     Output: ROW(foo_1.f1), foo_1.f1
-                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo foo_2
                      Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
-                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                     Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
+                           Output: ROW(foo_1.f1), foo_1.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Async Foreign Scan on public.foo2 foo_3
+                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
-                           Output: ROW(foo_1.f1), foo_1.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_2
                            Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
-                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..b04b6a0e54 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        async;            /* true if run asynchronously */
+    bool        queued;            /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last element in waiter queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->async = false;
+    fsstate->queued = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1487,40 +1539,244 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /*
+     * Do nothing if the node is already in the queue or already eof'ed.
+     * Note: leader node is not marked as queued.
+     */
+    if (leader == node || fsstate->queued || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* no leader means not busy, send request immediately */
+        request_more_data(node);
+    }
+    else
+    {
+        /* the connection is busy, queue the node */
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->queued = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *leader_state = GetPgFdwScanState(node);
+    ForeignScanState *next_leader = leader_state->waiter;
+
+    Assert(leader_state->s.commonstate->leader = node);
+
+    if (next_leader)
+    {
+        /* the first waiter becomes the next leader */
+        PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+        next_leader_state->last_waiter = leader_state->last_waiter;
+        next_leader_state->queued = false;
+    }
+
+    leader_state->waiter = NULL;
+    leader_state->s.commonstate->leader = next_leader;
+
+    return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->queued)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    if (leader == node)
+    {
+        /* It's the leader */
+        ForeignScanState    *next_leader;
+
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        move_to_next_waiter(node);
+
+        return;
+    }
+
+    /*
+     * Just remove the node from the queue
+     *
+     * Nodes don't have a link to the previous node but anyway this function is
+     * called on the shutdown path, so we don't bother seeking for faster way
+     * to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+
+            /* relink to the previous node if the last node was removed */
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+
+            fsstate->queued = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check async state to tell if all tuples received
+ *        (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
-     * Get some more tuples, if we've run out.
-     */
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /*
+             * finish the running query before sending the next command for
+             * this node
+             */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy, queue the request */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /*
+             * The request for the next node cannot be sent before the leader
+             * responds. Finish the current leader if possible.
+             */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* fetch the leader's data and enqueue it for the next request */
+            if (available)
+            {
+                fetch_received_data(leader);
+                add_async_waiter(leader);
+            }
+        }
+    }
+
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        else
+            node->ss.ps.asyncstate = AS_WAITING;
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1535,7 +1791,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1543,6 +1799,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1571,9 +1829,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1591,7 +1849,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1599,15 +1857,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2646,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2733,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2784,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2703,6 +2983,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2747,6 +3028,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3317,11 +3610,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3384,50 +3677,120 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof'ed */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* The node is the current leader, just return. */
+    if (leader == node)
+        return;
+
+    /* Let the node be the leader */
+    if (leader != NULL)
+    {
+        remove_async_node(node);
+        fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+        fsstate->waiter = leader;
+    }
+    else
+    {
+        fsstate->last_waiter = node;
+        fsstate->waiter = NULL;
+    }
+
+    fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* There's some remains. Move them to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
+        PGconn       *conn = fsstate->s.conn;
         char        sql[64];
-        int            numrows;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
-        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-                 fsstate->fetch_size, fsstate->cursor_number);
-
-        res = pgfdw_exec_query(conn, sql);
-        /* On error, report the original query, not the FETCH. */
+        res = pgfdw_get_result(conn, fsstate->query);
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3437,22 +3800,73 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    Assert(commonstate != NULL);
+
+    /* just return if the connection is already available */
+    if (commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read all of the result for the running
+     * query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3980,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3653,6 +4069,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3680,14 +4099,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3695,10 +4114,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3734,7 +4153,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3744,12 +4163,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3757,9 +4176,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3888,16 +4307,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4056,9 +4475,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4066,10 +4485,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5560,6 +5979,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.2


Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
> Removed a useless variable PgFdwScanState.result_ready.
> Removed duplicate code from remove_async_node() by using move_to_next_waiter().
> Done some minor cleanups.
> 
I am reviewing your code.
A couple of variables are no longer needed (see changes.patch in attachment.

Something about the cost of an asynchronous plan:

At the simple query plan (see below) I see:
1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But 
startup cost of Append is 0.
2. Total cost of an Append node is a sum of the subplans. Maybe in the 
case of asynchronous append we need to use some reduce factor?

explain select * from parts;

With Async Append:
=====================

  Append  (cost=0.00..2510.30 rows=106780 width=8)
    Async subplans: 3
    ->  Async Foreign Scan on part_1 parts_2  (cost=100.00..177.80 
rows=2260 width=8)
    ->  Async Foreign Scan on part_2 parts_3  (cost=100.00..177.80 
rows=2260 width=8)
    ->  Async Foreign Scan on part_3 parts_4  (cost=100.00..177.80 
rows=2260 width=8)
    ->  Seq Scan on part_0 parts_1  (cost=0.00..1443.00 rows=100000 width=8)

Without Async Append:
=====================

  Append  (cost=0.00..2510.30 rows=106780 width=8)
    ->  Seq Scan on part_0 parts_1  (cost=0.00..1443.00 rows=100000 width=8)
    ->  Foreign Scan on part_1 parts_2  (cost=100.00..177.80 rows=2260 
width=8)
    ->  Foreign Scan on part_2 parts_3  (cost=100.00..177.80 rows=2260 
width=8)
    ->  Foreign Scan on part_3 parts_4  (cost=100.00..177.80 rows=2260 
width=8)

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello, Andrey.

At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in 
> On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
> > Removed a useless variable PgFdwScanState.result_ready.
> > Removed duplicate code from remove_async_node() by using
> > move_to_next_waiter().
> > Done some minor cleanups.
> > 
> I am reviewing your code.
> A couple of variables are no longer needed (see changes.patch in
> attachment.

Thanks! The recent changes made them useless. Fixed.

> Something about the cost of an asynchronous plan:
> 
> At the simple query plan (see below) I see:
> 1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But
> startup cost of Append is 0.

The result itself is right that the append doesn't wait for foreign
scans for the first iteration then fetches a tuple from the local
table.  But the estimation is made just by an accident.  If you
defined a foreign table as the first partition, the cost of Append
would be 100, which is rather wrong.

> 2. Total cost of an Append node is a sum of the subplans. Maybe in the
> case of asynchronous append we need to use some reduce factor?

Yes.  For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist.  If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.

I fixed cost_append so that it calculates the correct startup
cost. Now the function estimates as follows.

Append (Foreign(100), Foreign(100), Local(0)) => 0;
Append (Local(0), Foreign(100), Foreign(100)) => 0;
Append (Foreign(100), Foreign(100)) => 100;


> explain select * from parts;
> 
> With Async Append:
> =====================
> 
>  Append  (cost=0.00..2510.30 rows=106780 width=8)
>    Async subplans: 3
>    ->  Async Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260
>    ->  width=8)
>    ->  Async Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260
>    ->  width=8)
>    ->  Async Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260
>    ->  width=8)
>    ->  Seq Scan on part_0 parts_1  (cost=0.00..1443.00 rows=100000 width=8)

The SeqScan seems to be the first partition for the parent. It is the
first subnode at cost estimation. The result is right but it comes
from a wrong logic.

> Without Async Append:
> =====================
> 
>  Append  (cost=0.00..2510.30 rows=106780 width=8)
>    ->  Seq Scan on part_0 parts_1  (cost=0.00..1443.00 rows=100000 width=8)
>    ->  Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260 width=8)
>    ->  Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260 width=8)
>    ->  Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260 width=8)

The starup cost of the Append is the cost of the first subnode, that is, 0.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 281ed344f13352dd00bd08926fd8dd12ae182d01 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v4 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 +++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 91fa4b619b..10d71b46cb 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.2

From 4a172f8009b881ac17ebabcf5f12b0fccca004ba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v4 2/3] infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/path/allpaths.c   |  24 ++
 src/backend/optimizer/path/costsize.c   |  40 ++-
 src/backend/optimizer/plan/createplan.c |  41 ++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/syslogger.c      |   2 +-
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/backend/utils/resowner/resowner.c   |   4 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/optimizer/paths.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 25 files changed, 740 insertions(+), 71 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9092b4b309..d35de920c8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
                                        List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1969,6 +1972,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2322,6 +2330,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Retry fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d984da25d7..bb4c8723bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3937,6 +3937,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
     list_free(live_children);
 }
 
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
+
 
 /*****************************************************************************
  *            DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b976afb69d..c4c83b5887 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2050,13 +2050,9 @@ cost_append(AppendPath *apath)
 
         if (pathkeys == NIL)
         {
-            Path       *subpath = (Path *) linitial(apath->subpaths);
+            Cost        async_min_startup_cost = -1.0;
 
-            /*
-             * For an unordered, non-parallel-aware Append we take the startup
-             * cost as the startup cost of the first subpath.
-             */
-            apath->path.startup_cost = subpath->startup_cost;
+            apath->path.startup_cost = -1.0;
 
             /* Compute rows and costs as sums of subplan rows and costs. */
             foreach(l, apath->subpaths)
@@ -2065,6 +2061,38 @@ cost_append(AppendPath *apath)
 
                 apath->path.rows += subpath->rows;
                 apath->path.total_cost += subpath->total_cost;
+
+                if (!is_async_capable_path(subpath))
+                {
+                    /*
+                     * For an unordered, non-parallel-aware Append we take the
+                     * startup cost as the startup cost of the first subpath.
+                     */
+                    if (apath->path.startup_cost < 0)
+                        apath->path.startup_cost = subpath->startup_cost;
+                }
+                else if (apath->path.startup_cost < 0)
+                {
+                    /*
+                     * Usually async-capable paths don't affect the startup
+                     * cost of Append.  However, if no sync paths are
+                     * contained, the startup cost of the Append is the minimal
+                     * async startup cost. Remember the cost just in case.
+                     */
+                    if (async_min_startup_cost < 0 ||
+                        async_min_startup_cost > subpath->startup_cost)
+                        async_min_startup_cost = subpath->startup_cost;
+                }
+            }
+
+            /*
+             * Use the minimum async startup cost if no sync startup has been
+             * found.
+             */
+            if (apath->path.startup_cost < 0)
+            {
+                Assert(async_min_startup_cost >= 0);
+                apath->path.startup_cost = async_min_startup_cost;
             }
         }
         else
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..7f75708134 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,36 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * children in parallel, we cannot any one of them run asynchronously.
+         */
+        if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1249,7 +1284,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1257,6 +1292,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 309378ae54..d9ea75d823 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3882,6 +3882,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_XACT_GROUP_UPDATE:
             event_name = "XactGroupUpdate";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 076c3c019f..f7b5587d7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4584,10 +4584,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
 ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
 PrintWESLeakWarning(WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 98e0072b8a..cd50494c74 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution stuff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* results of each async plan */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asynchronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
                                     List *live_childrels);
 
+extern bool is_async_capable_path(Path *path);
+
 #endif                            /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..2259910637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
     WAIT_EVENT_SYNC_REP,
-    WAIT_EVENT_XACT_GROUP_UPDATE
+    WAIT_EVENT_XACT_GROUP_UPDATE,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.2

From 5c64d2d3315d7e38676f349c10f94f445da2da58 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v4 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 222 ++++---
 contrib/postgres_fdw/postgres_fdw.c           | 601 +++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 691 insertions(+), 182 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not yet. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..29aa09db8e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
+                     Output: ROW(foo_1.f1), foo_1.f1
+                     Remote SQL: SELECT f1 FROM public.loct1
+               ->  Async Foreign Scan on public.foo2 foo_3
+                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo
                      Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
-                     Output: ROW(foo_1.f1), foo_1.f1
-                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo foo_2
                      Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
-                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                     Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
+                           Output: ROW(foo_1.f1), foo_1.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Async Foreign Scan on public.foo2 foo_3
+                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
-                           Output: ROW(foo_1.f1), foo_1.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_2
                            Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
-                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        async;            /* true if run asynchronously */
+    bool        queued;            /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last element in waiter queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->async = false;
+    fsstate->queued = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /*
+     * Do nothing if the node is already in the queue or already eof'ed.
+     * Note: leader node is not marked as queued.
+     */
+    if (leader == node || fsstate->queued || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* no leader means not busy, send request immediately */
+        request_more_data(node);
+    }
+    else
+    {
+        /* the connection is busy, queue the node */
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->queued = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *leader_state = GetPgFdwScanState(node);
+    ForeignScanState *next_leader = leader_state->waiter;
+
+    Assert(leader_state->s.commonstate->leader = node);
+
+    if (next_leader)
+    {
+        /* the first waiter becomes the next leader */
+        PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+        next_leader_state->last_waiter = leader_state->last_waiter;
+        next_leader_state->queued = false;
+    }
+
+    leader_state->waiter = NULL;
+    leader_state->s.commonstate->leader = next_leader;
+
+    return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->queued)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    if (leader == node)
+    {
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        move_to_next_waiter(node);
+
+        return;
+    }
+
+    /*
+     * Just remove the node from the queue
+     *
+     * Nodes don't have a link to the previous node but anyway this function is
+     * called on the shutdown path, so we don't bother seeking for faster way
+     * to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+
+            /* relink to the previous node if the last node was removed */
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+
+            fsstate->queued = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check async state to tell if all tuples received
+ *        (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
-     * Get some more tuples, if we've run out.
-     */
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /*
+             * finish the running query before sending the next command for
+             * this node
+             */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy, queue the request */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /*
+             * The request for the next node cannot be sent before the leader
+             * responds. Finish the current leader if possible.
+             */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* fetch the leader's data and enqueue it for the next request */
+            if (available)
+            {
+                fetch_received_data(leader);
+                add_async_waiter(leader);
+            }
+        }
+    }
+
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        else
+            node->ss.ps.asyncstate = AS_WAITING;
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof'ed */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* The node is the current leader, just return. */
+    if (leader == node)
+        return;
+
+    /* Let the node be the leader */
+    if (leader != NULL)
+    {
+        remove_async_node(node);
+        fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+        fsstate->waiter = leader;
+    }
+    else
+    {
+        fsstate->last_waiter = node;
+        fsstate->waiter = NULL;
+    }
+
+    fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* There's some remains. Move them to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
-        char        sql[64];
-        int            numrows;
+        PGconn       *conn = fsstate->s.conn;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
-        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-                 fsstate->fetch_size, fsstate->cursor_number);
-
-        res = pgfdw_exec_query(conn, sql);
-        /* On error, report the original query, not the FETCH. */
+        res = pgfdw_get_result(conn, fsstate->query);
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    Assert(commonstate != NULL);
+
+    /* just return if the connection is already available */
+    if (commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read all of the result for the running
+     * query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.2


Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 6/10/20 8:05 AM, Kyotaro Horiguchi wrote:
> Hello, Andrey.
> 
> At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
>> On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
>> 2. Total cost of an Append node is a sum of the subplans. Maybe in the
>> case of asynchronous append we need to use some reduce factor?
> 
> Yes.  For the reason mentioned above, foreign subpaths don't affect
> the startup cost of Append as far as any sync subpaths exist.  If no
> sync subpaths exist, the Append's startup cost is the minimum startup
> cost among the async subpaths.
I mean that you can possibly change computation of total cost of the 
Async append node. It may affect the planner choice between ForeignScan 
(followed by the execution of the JOIN locally) and partitionwise join 
strategies.

Have you also considered the possibility of dynamic choice between 
synchronous and async append (during optimization)? This may be useful 
for a query with the LIMIT clause.

-- 
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
The patch has a problem with partitionwise aggregates.

Asynchronous append do not allow the planner to use partial aggregates. 
Example you can see in attachment. I can't understand why: costs of 
partitionwise join are less.
Initial script and explains of the query with and without the patch you 
can see in attachment.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com


Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thanks for testing, but..

At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in 
> The patch has a problem with partitionwise aggregates.
> 
> Asynchronous append do not allow the planner to use partial
> aggregates. Example you can see in attachment. I can't understand why:
> costs of partitionwise join are less.
> Initial script and explains of the query with and without the patch
> you can see in attachment.

I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).

I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.

Could you tell me how did you get the first plan?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 6/15/20 1:29 PM, Kyotaro Horiguchi wrote:
> Thanks for testing, but..
> 
> At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
>> The patch has a problem with partitionwise aggregates.
>>
>> Asynchronous append do not allow the planner to use partial
>> aggregates. Example you can see in attachment. I can't understand why:
>> costs of partitionwise join are less.
>> Initial script and explains of the query with and without the patch
>> you can see in attachment.
> 
> I had more or less the same plan with the second one without the patch
> (that is, vanilla master/HEAD, but used merge joins instead).
> 
> I'm not sure what prevented join pushdown, but the difference between
> the two is whether the each partitionwise join is pushed down to
> remote or not, That is hardly seems related to the async execution
> patch.
> 
> Could you tell me how did you get the first plan?

1. Use clear current vanilla master.

2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true

3. Execute query:
explain analyze SELECT sum(parts.b)
    FROM parts, second
    WHERE parts.a = second.a AND second.b < 100;

That's all.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thanks.

My conclusion on this is the async patch is not the cause of the
behavior change mentioned here.

At Mon, 15 Jun 2020 14:59:18 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in 
> > Could you tell me how did you get the first plan?
> 
> 1. Use clear current vanilla master.
> 
> 2. Start two instances with the script 'frgn2n.sh' from attachment.
> There are I set GUCs:
> enable_partitionwise_join = true
> enable_partitionwise_aggregate = true
> 
> 3. Execute query:
> explain analyze SELECT sum(parts.b)
>     FROM parts, second
>     WHERE parts.a = second.a AND second.b < 100;
> 
> That's all.

With mater/HEAD, I got the second (local join) plan for a while first
then got the first (remote join). The cause of the plan change was
found to be autovacuum on the remote node.

Before the vacuum the result of remote estimation was as follows.

Node2 (remote)
=#  EXPLAIN SELECT r4.b FROM (public.part_1 r4 INNER JOIN public.second_1 r8 ON (((r4.a = r8.a)) AND ((r8.b < 100))));
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Merge Join  (cost=2269.20..3689.70 rows=94449 width=4)
   Merge Cond: (r8.a = r4.a)
   ->  Sort  (cost=74.23..76.11 rows=753 width=4)
         Sort Key: r8.a
         ->  Seq Scan on second_1 r8  (cost=0.00..38.25 rows=753 width=4)
               Filter: (b < 100)
   ->  Sort  (cost=2194.97..2257.68 rows=25086 width=8)
         Sort Key: r4.a
         ->  Seq Scan on part_1 r4  (cost=0.00..361.86 rows=25086 width=8)
(9 rows)

After running a vacuum it changes as follows.

                               QUERY PLAN                               
------------------------------------------------------------------------
 Hash Join  (cost=5.90..776.31 rows=9741 width=4)
   Hash Cond: (r4.a = r8.a)
   ->  Seq Scan on part_1 r4  (cost=0.00..360.78 rows=24978 width=8)
   ->  Hash  (cost=4.93..4.93 rows=78 width=4)
         ->  Seq Scan on second_1 r8  (cost=0.00..4.93 rows=78 width=4)
               Filter: (b < 100)
(6 rows)

That changes the plan on the local side the way you saw.  I saw the
exactly same behavior with the async execution patch.

regards.




FYI, the explain results for another plan changed as follows. It is
estimated to return 25839 rows, which is far less than 94449. So local
join beated remote join.

=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
                            QUERY PLAN                            
------------------------------------------------------------------
 Sort  (cost=2194.97..2257.68 rows=25086 width=8)
   Sort Key: a
   ->  Seq Scan on part_1  (cost=0.00..361.86 rows=25086 width=8)
(3 rows)
=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Sort  (cost=74.23..76.11 rows=753 width=4)
   Sort Key: a
   ->  Seq Scan on second_1  (cost=0.00..38.25 rows=753 width=4)
         Filter: (b < 100)
(4 rows)

Are changed to:

=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
                            QUERY PLAN                            
------------------------------------------------------------------
 Sort  (cost=2185.22..2247.66 rows=24978 width=8)
   Sort Key: a
   ->  Seq Scan on part_1  (cost=0.00..360.78 rows=24978 width=8)
(3 rows)

horiguti=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
                          QUERY PLAN                           
---------------------------------------------------------------
 Sort  (cost=7.38..7.57 rows=78 width=4)
   Sort Key: a
   ->  Seq Scan on second_1  (cost=0.00..4.93 rows=78 width=4)
         Filter: (b < 100)
(4 rows)

They return 25056 rows, which is far more than 9741 rows. So remote
join won.

Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:
> They return 25056 rows, which is far more than 9741 rows. So remote
> join won.
> 
> Of course the number of returning rows is not the only factor of the
> cost change but is the most significant factor in this case.
> 
Thanks for the attention.
I see one slight flaw of this approach to asynchronous append:
AsyncAppend works only for ForeignScan subplans. if we have 
PartialAggregate, Join or another more complicated subplan, we can't use 
asynchronous machinery.
It may lead to a situation than small difference in a filter constant 
can cause a big difference in execution time.
I imagine an Append node, that can switch current subplan from time to 
time and all ForeignScan nodes of the overall plan are added to one 
queue. The scan buffer can be larger than a cursor fetch size and each 
IterateForeignScan() call can induce asynchronous scan of another 
ForeignScan node if buffer is not full.
But these are only thoughts, not an proposal. I have no questions to 
your patch right now.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Wed, 17 Jun 2020 15:01:08 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in 
> On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:
> > They return 25056 rows, which is far more than 9741 rows. So remote
> > join won.
> > Of course the number of returning rows is not the only factor of the
> > cost change but is the most significant factor in this case.
> > 
> Thanks for the attention.
> I see one slight flaw of this approach to asynchronous append:
> AsyncAppend works only for ForeignScan subplans. if we have
> PartialAggregate, Join or another more complicated subplan, we can't
> use asynchronous machinery.

Yes, the asynchronous append works only when it has at least one
async-capable immediate subnode. Currently there's only one
async-capable node, ForeignScan.

> I imagine an Append node, that can switch current subplan from time to
> time and all ForeignScan nodes of the overall plan are added to one
> queue. The scan buffer can be larger than a cursor fetch size and each
> IterateForeignScan() call can induce asynchronous scan of another
> ForeignScan node if buffer is not full.
> But these are only thoughts, not an proposal. I have no questions to
> your patch right now.

A major property of async-capable nodes is yieldability(?), that is,
it ought to be able to give way for other nodes when it is not ready
to return a tuple. That means such nodes are state machine rather than
function.  Fortunately ForeignScan is natively a kind of state machine
in a sense so it is easily turned into async-capable node. Append is
also a state machine in the same sense but currently no other nodes
can use it as async-capable node.

For example, an Agg or Sort node generally needs two or more tuples
from its subnode to generate a tuple to be returned to parent.  Some
working memory is needed while generating a returning tuple.  If the
node takes in a tuple from a subnode but not generated a result tuple,
the node must yield CPU time to other nodes. These nodes are not state
machines at all and it is somewhat hard to make it so.  It gets quite
complex in WindowAgg since it calls subnodes at arbitrary call level
of component functions.

Further issue is leaf scan nodes, SeqScan, IndexScan, etc. also need
to be asynchronous.

Finally the executor will turn into push-up style from the current
volcano (pull-style).

I tried all of that (perhaps except scan nodes) a couple of years ago
but the result was a kind of crap^^;

After all, I returned to the current shape.  It doesn't seem bad as
Thomas proposed the same thing.


*1: async-aware is defined (here) as a node that can have
    async-capable subnodes.

> It may lead to a situation than small difference in a filter constant
> can cause a big difference in execution time.

It is what we usually see?  We could get a big win for certain
condition without a loss even otherwise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello.

As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.

However, I fixed two misbehaviors and rebased.

A. It runs ordered Append asynchronously, but that leads to a bogus
  result. I taught create_append_plan not to make subnodes async when
  pathkey is not NIL.

B. It calculated the total cost of Append by summing up total costs of
  all subnodes including async subnodes.  It is too pessimistic so I
  changed that to the following.

    Max(total cost of sync subnodes, maximum cost of async subnodes);

  However this is a bit too optimistic in that it ignores interference
  between async subnodes, it is more realistic in the cases where the
  subnode ForeignScans are connecting to different servers.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 76349549522b1c8ac9bad637cce763718331a066 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v5 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 +++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 91fa4b619b..10d71b46cb 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     {
         WaitEventSet *new_event_set;
 
-        new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+        new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
         AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
                           MyLatch, NULL);
         AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 46ae56cae3..b1b8375768 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.4

From e45b0a7c2a832a2e02411528e95efb4441d7d22d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v5 2/3] Infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/path/allpaths.c   |  24 ++
 src/backend/optimizer/path/costsize.c   |  55 +++-
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/syslogger.c      |   2 +-
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/backend/utils/resowner/resowner.c   |   4 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/optimizer/paths.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 25 files changed, 757 insertions(+), 73 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 093864cfc0..244676ba11 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
                                        List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1970,6 +1973,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2323,6 +2331,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Retry fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..89a49e2fdc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d984da25d7..bb4c8723bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3937,6 +3937,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
     list_free(live_children);
 }
 
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
+
 
 /*****************************************************************************
  *            DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4ff3c7a2fd..ccaeb8cc5c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2049,22 +2049,59 @@ cost_append(AppendPath *apath)
 
         if (pathkeys == NIL)
         {
-            Path       *subpath = (Path *) linitial(apath->subpaths);
-
-            /*
-             * For an unordered, non-parallel-aware Append we take the startup
-             * cost as the startup cost of the first subpath.
-             */
-            apath->path.startup_cost = subpath->startup_cost;
+            Cost        first_nonasync_startup_cost = -1.0;
+            Cost        async_min_startup_cost = -1;
+            Cost        async_max_cost = 0.0;
 
             /* Compute rows and costs as sums of subplan rows and costs. */
             foreach(l, apath->subpaths)
             {
                 Path       *subpath = (Path *) lfirst(l);
 
+                /*
+                 * For an unordered, non-parallel-aware Append we take the
+                 * startup cost as the startup cost of the first
+                 * nonasync-capable subpath or the minimum startup cost of
+                 * async-capable subpaths.
+                 */
+                if (!is_async_capable_path(subpath))
+                {
+                    if (first_nonasync_startup_cost < 0.0)
+                        first_nonasync_startup_cost = subpath->startup_cost;
+
+                    apath->path.total_cost += subpath->total_cost;
+                }
+                else
+                {
+                    if (async_min_startup_cost < 0.0 ||
+                        async_min_startup_cost > subpath->startup_cost)
+                        async_min_startup_cost = subpath->startup_cost;
+
+                    /*
+                     * It's not obvious how to determine the total cost of
+                     * async subnodes. Although it is not always true, we
+                     * assume it is the maximum cost among all async subnodes.
+                     */
+                    if (async_max_cost < subpath->total_cost)
+                        async_max_cost = subpath->total_cost;
+                }
+
                 apath->path.rows += subpath->rows;
-                apath->path.total_cost += subpath->total_cost;
             }
+
+            /*
+             * If there's an sync subnodes, the startup cost is the startup
+             * cost of the first sync subnode. Otherwise it's the minimal
+             * startup cost of async subnodes.
+             */
+            if (first_nonasync_startup_cost >= 0.0)
+                apath->path.startup_cost = first_nonasync_startup_cost;
+            else
+                apath->path.startup_cost = async_min_startup_cost;
+
+            /* Use async maximum cost if it exceeds the sync total cost */
+            if (async_max_cost > apath->path.total_cost)
+                apath->path.total_cost = async_max_cost;
         }
         else
         {
@@ -2085,6 +2122,8 @@ cost_append(AppendPath *apath)
              * This case is also different from the above in that we have to
              * account for possibly injecting sorts into subpaths that aren't
              * natively ordered.
+             *
+             * Note: An ordered append won't be run asynchronously.
              */
             foreach(l, apath->subpaths)
             {
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..27ff01f159 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * children in parallel, we cannot any one of them run asynchronously.
+         * Planner thinks that all subnodes are executed in order if this
+         * append is orderd. No subpaths cannot be run asynchronously in that
+         * case.
+         */
+        if (pathkeys == NIL &&
+            !best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c022597bc0..4db86252c9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_XACT_GROUP_UPDATE:
             event_name = "XactGroupUpdate";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 2cbcb4b85e..46a4b0696f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
 ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
 PrintWESLeakWarning(WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..aca9e2bddd 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f5dfa32d55..8e230ee5c3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -938,6 +938,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1026,6 +1032,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution stuff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1221,14 +1232,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* results of each async plan */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1796,6 +1814,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asynchronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
                                     List *live_childrels);
 
+extern bool is_async_capable_path(Path *path);
+
 #endif                            /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..c0ea7f5aa4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
     WAIT_EVENT_SYNC_REP,
-    WAIT_EVENT_XACT_GROUP_UPDATE
+    WAIT_EVENT_XACT_GROUP_UPDATE,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.4

From 37cf695f5f019bcfd6554473a50d4ff6f7b44462 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v5 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 272 ++++----
 contrib/postgres_fdw/postgres_fdw.c           | 601 +++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 710 insertions(+), 213 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..d9edc5e4de 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -215,6 +217,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not yet. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 82fc1290ef..bf9b4041cd 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
+                     Output: ROW(foo_1.f1), foo_1.f1
+                     Remote SQL: SELECT f1 FROM public.loct1
+               ->  Async Foreign Scan on public.foo2 foo_3
+                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo
                      Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
-                     Output: ROW(foo_1.f1), foo_1.f1
-                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo foo_2
                      Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
-                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                     Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
+                           Output: ROW(foo_1.f1), foo_1.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Async Foreign Scan on public.foo2 foo_3
+                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
-                           Output: ROW(foo_1.f1), foo_1.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_2
                            Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
-                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
@@ -8795,29 +8825,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  Sort
    Output: t1.a, (count(((t1.*)::pagg_tab)))
    Sort Key: t1.a
-   ->  Append
-         ->  HashAggregate
-               Output: t1.a, count(((t1.*)::pagg_tab))
-               Group Key: t1.a
-               Filter: (avg(t1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p1 t1
-                     Output: t1.a, t1.*, t1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-         ->  HashAggregate
-               Output: t1_1.a, count(((t1_1.*)::pagg_tab))
-               Group Key: t1_1.a
-               Filter: (avg(t1_1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p2 t1_1
+   ->  HashAggregate
+         Output: t1.a, count(((t1.*)::pagg_tab))
+         Group Key: t1.a
+         Filter: (avg(t1.b) < '22'::numeric)
+         ->  Append
+               Async subplans: 3 
+               ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
                      Output: t1_1.a, t1_1.*, t1_1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-         ->  HashAggregate
-               Output: t1_2.a, count(((t1_2.*)::pagg_tab))
-               Group Key: t1_2.a
-               Filter: (avg(t1_2.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p3 t1_2
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+               ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
                      Output: t1_2.a, t1_2.*, t1_2.b
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+               ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
+                     Output: t1_3.a, t1_3.*, t1_3.b
                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
 
 SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | count 
@@ -8837,20 +8860,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 -----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.b
-   ->  Finalize HashAggregate
+   ->  HashAggregate
          Group Key: pagg_tab.b
          Filter: (sum(pagg_tab.a) < 700)
          ->  Append
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab.b
-                     ->  Foreign Scan on fpagg_tab_p1 pagg_tab
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_1.b
-                     ->  Foreign Scan on fpagg_tab_p2 pagg_tab_1
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_2.b
-                     ->  Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- ===================================================================
 -- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        async;            /* true if run asynchronously */
+    bool        queued;            /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last element in waiter queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->async = false;
+    fsstate->queued = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /*
+     * Do nothing if the node is already in the queue or already eof'ed.
+     * Note: leader node is not marked as queued.
+     */
+    if (leader == node || fsstate->queued || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* no leader means not busy, send request immediately */
+        request_more_data(node);
+    }
+    else
+    {
+        /* the connection is busy, queue the node */
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->queued = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *leader_state = GetPgFdwScanState(node);
+    ForeignScanState *next_leader = leader_state->waiter;
+
+    Assert(leader_state->s.commonstate->leader = node);
+
+    if (next_leader)
+    {
+        /* the first waiter becomes the next leader */
+        PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+        next_leader_state->last_waiter = leader_state->last_waiter;
+        next_leader_state->queued = false;
+    }
+
+    leader_state->waiter = NULL;
+    leader_state->s.commonstate->leader = next_leader;
+
+    return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->queued)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    if (leader == node)
+    {
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        move_to_next_waiter(node);
+
+        return;
+    }
+
+    /*
+     * Just remove the node from the queue
+     *
+     * Nodes don't have a link to the previous node but anyway this function is
+     * called on the shutdown path, so we don't bother seeking for faster way
+     * to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+
+            /* relink to the previous node if the last node was removed */
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+
+            fsstate->queued = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check async state to tell if all tuples received
+ *        (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
-     * Get some more tuples, if we've run out.
-     */
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /*
+             * finish the running query before sending the next command for
+             * this node
+             */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy, queue the request */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /*
+             * The request for the next node cannot be sent before the leader
+             * responds. Finish the current leader if possible.
+             */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* fetch the leader's data and enqueue it for the next request */
+            if (available)
+            {
+                fetch_received_data(leader);
+                add_async_waiter(leader);
+            }
+        }
+    }
+
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        else
+            node->ss.ps.asyncstate = AS_WAITING;
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof'ed */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* The node is the current leader, just return. */
+    if (leader == node)
+        return;
+
+    /* Let the node be the leader */
+    if (leader != NULL)
+    {
+        remove_async_node(node);
+        fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+        fsstate->waiter = leader;
+    }
+    else
+    {
+        fsstate->last_waiter = node;
+        fsstate->waiter = NULL;
+    }
+
+    fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* There's some remains. Move them to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
-        char        sql[64];
-        int            numrows;
+        PGconn       *conn = fsstate->s.conn;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
-        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-                 fsstate->fetch_size, fsstate->cursor_number);
-
-        res = pgfdw_exec_query(conn, sql);
-        /* On error, report the original query, not the FETCH. */
+        res = pgfdw_get_result(conn, fsstate->query);
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    Assert(commonstate != NULL);
+
+    /* just return if the connection is already available */
+    if (commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read all of the result for the running
+     * query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.4


Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
Horiguchi-san,

On Thu, Jul 2, 2020 at 11:14 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> As the result of a discussion with Fujita-san off-list, I'm going to
> hold off development until he decides whether mine or Thomas' is
> better.

I'd like to join the party, but IIUC, we don't yet reach a consensus
on which one is the right way to go.  So I think we need to discuss
that first.

> However, I fixed two misbehaviors and rebased.

Thank you for the updated patch!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Thomas Munro
Date:
On Thu, Jul 2, 2020 at 3:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Jul 2, 2020 at 11:14 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > As the result of a discussion with Fujita-san off-list, I'm going to
> > hold off development until he decides whether mine or Thomas' is
> > better.
>
> I'd like to join the party, but IIUC, we don't yet reach a consensus
> on which one is the right way to go.  So I think we need to discuss
> that first.

Either way, we definitely need patch 0001.  One comment:

-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)

I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.  I think the latter is more
common in existing code.



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Aug 14, 2020 at 10:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jul 2, 2020 at 3:20 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > I'd like to join the party, but IIUC, we don't yet reach a consensus
> > on which one is the right way to go.  So I think we need to discuss
> > that first.
>
> Either way, we definitely need patch 0001.  One comment:
>
> -CreateWaitEventSet(MemoryContext context, int nevents)
> +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
>
> I wonder if it's better to have it receive ResourceOwner like that, or
> to have it capture CurrentResourceOwner.  I think the latter is more
> common in existing code.

Sorry for not having discussed anything, but actually, I’ve started
reviewing your patch first.  I’ll return to this after reviewing it
some more.

Thanks!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in 
> On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
> > As the result of a discussion with Fujita-san off-list, I'm going to
> > hold off development until he decides whether mine or Thomas' is
> > better.
> 
> The latest patch doesn't apply so I set as WoA.
> https://commitfest.postgresql.org/29/2491/

Thanks. This is rebased version.

At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> Either way, we definitely need patch 0001.  One comment:
> 
> -CreateWaitEventSet(MemoryContext context, int nevents)
> +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
> 
> I wonder if it's better to have it receive ResourceOwner like that, or
> to have it capture CurrentResourceOwner.  I think the latter is more
> common in existing code.

There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.

Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 18176c9caa856c707ef6e8ab64bfc7f8abd9aea6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v6 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c            |  2 +-
 src/backend/postmaster/pgstat.c       |  2 +-
 src/backend/postmaster/syslogger.c    |  2 +-
 src/backend/storage/ipc/latch.c       | 20 ++++++--
 src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h           |  4 +-
 src/include/utils/resowner_private.h  |  8 ++++
 7 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ac986c0505..799fa5006d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..9d6b3778b4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4488,7 +4488,7 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
     /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
     AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4153cc8557..e771ac9610 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -57,6 +57,7 @@
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 #include "utils/memutils.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -85,6 +86,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -257,7 +260,7 @@ InitializeLatchWaitSet(void)
     Assert(LatchWaitSet == NULL);
 
     /* Set up the WaitEventSet used by WaitLatch(). */
-    LatchWaitSet = CreateWaitEventSet(TopMemoryContext, 2);
+    LatchWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 2);
     latch_pos = AddWaitEventToSet(LatchWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
                                   MyLatch, NULL);
     if (IsUnderPostmaster)
@@ -441,7 +444,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -608,12 +611,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -728,6 +734,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -773,6 +784,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..ae13d4c08d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.4

From 87edb960381302fb487e5200d3cd228eb8a9b413 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v6 2/3] Infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/path/allpaths.c   |  24 ++
 src/backend/optimizer/path/costsize.c   |  55 +++-
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/backend/utils/resowner/resowner.c   |   4 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/optimizer/paths.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 24 files changed, 756 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 30e0a7ee7f..07001da4a3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
                                        List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1389,6 +1390,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1970,6 +1973,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2323,6 +2331,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Retry fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de66..db0234b17a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..d4bb44b268 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..055c8a9fb0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
     list_free(live_children);
 }
 
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
+
 
 /*****************************************************************************
  *            DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index fda4b2c6e8..da59c48091 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2048,22 +2048,59 @@ cost_append(AppendPath *apath)
 
         if (pathkeys == NIL)
         {
-            Path       *subpath = (Path *) linitial(apath->subpaths);
-
-            /*
-             * For an unordered, non-parallel-aware Append we take the startup
-             * cost as the startup cost of the first subpath.
-             */
-            apath->path.startup_cost = subpath->startup_cost;
+            Cost        first_nonasync_startup_cost = -1.0;
+            Cost        async_min_startup_cost = -1;
+            Cost        async_max_cost = 0.0;
 
             /* Compute rows and costs as sums of subplan rows and costs. */
             foreach(l, apath->subpaths)
             {
                 Path       *subpath = (Path *) lfirst(l);
 
+                /*
+                 * For an unordered, non-parallel-aware Append we take the
+                 * startup cost as the startup cost of the first
+                 * nonasync-capable subpath or the minimum startup cost of
+                 * async-capable subpaths.
+                 */
+                if (!is_async_capable_path(subpath))
+                {
+                    if (first_nonasync_startup_cost < 0.0)
+                        first_nonasync_startup_cost = subpath->startup_cost;
+
+                    apath->path.total_cost += subpath->total_cost;
+                }
+                else
+                {
+                    if (async_min_startup_cost < 0.0 ||
+                        async_min_startup_cost > subpath->startup_cost)
+                        async_min_startup_cost = subpath->startup_cost;
+
+                    /*
+                     * It's not obvious how to determine the total cost of
+                     * async subnodes. Although it is not always true, we
+                     * assume it is the maximum cost among all async subnodes.
+                     */
+                    if (async_max_cost < subpath->total_cost)
+                        async_max_cost = subpath->total_cost;
+                }
+
                 apath->path.rows += subpath->rows;
-                apath->path.total_cost += subpath->total_cost;
             }
+
+            /*
+             * If there's an sync subnodes, the startup cost is the startup
+             * cost of the first sync subnode. Otherwise it's the minimal
+             * startup cost of async subnodes.
+             */
+            if (first_nonasync_startup_cost >= 0.0)
+                apath->path.startup_cost = first_nonasync_startup_cost;
+            else
+                apath->path.startup_cost = async_min_startup_cost;
+
+            /* Use async maximum cost if it exceeds the sync total cost */
+            if (async_max_cost > apath->path.total_cost)
+                apath->path.total_cost = async_max_cost;
         }
         else
         {
@@ -2084,6 +2121,8 @@ cost_append(AppendPath *apath)
              * This case is also different from the above in that we have to
              * account for possibly injecting sorts into subpaths that aren't
              * natively ordered.
+             *
+             * Note: An ordered append won't be run asynchronously.
              */
             foreach(l, apath->subpaths)
             {
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99278eed93..b38cb5e4ca 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * children in parallel, we cannot any one of them run asynchronously.
+         * Planner thinks that all subnodes are executed in order if this
+         * append is orderd. No subpaths cannot be run asynchronously in that
+         * case.
+         */
+        if (pathkeys == NIL &&
+            !best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9d6b3778b4..1765c56545 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_XACT_GROUP_UPDATE:
             event_name = "XactGroupUpdate";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 60dd80c23c..5680c58739 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
 ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
 PrintWESLeakWarning(WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 415e117407..9cf2c1f676 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0b42dd6f94..2d47a4162e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -940,6 +940,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1028,6 +1034,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution stuff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1223,14 +1234,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* results of each async plan */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1798,6 +1816,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asynchronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
                                     List *live_childrels);
 
+extern bool is_async_capable_path(Path *path);
+
 #endif                            /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..c0ea7f5aa4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
     WAIT_EVENT_SYNC_REP,
-    WAIT_EVENT_XACT_GROUP_UPDATE
+    WAIT_EVENT_XACT_GROUP_UPDATE,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.4

From 74f8a34ee84ef45cffd5b5c51f67301d3b176cab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v6 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 272 ++++----
 contrib/postgres_fdw/postgres_fdw.c           | 601 +++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 710 insertions(+), 213 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..be5948f613 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -203,6 +204,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -216,6 +218,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not yet. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 90db550b92..9374fa3a6c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6973,7 +6973,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7001,7 +7001,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7029,7 +7029,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7057,7 +7057,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7127,35 +7127,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7165,35 +7171,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7223,11 +7235,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7241,12 +7254,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7276,16 +7290,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
+                     Output: ROW(foo_1.f1), foo_1.f1
+                     Remote SQL: SELECT f1 FROM public.loct1
+               ->  Async Foreign Scan on public.foo2 foo_3
+                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo
                      Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
-                     Output: ROW(foo_1.f1), foo_1.f1
-                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo foo_2
                      Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
-                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                     Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7303,17 +7318,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
+                           Output: ROW(foo_1.f1), foo_1.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Async Foreign Scan on public.foo2 foo_3
+                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
-                           Output: ROW(foo_1.f1), foo_1.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_2
                            Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
-                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7463,27 +7479,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8558,11 +8580,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8597,20 +8620,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8639,11 +8664,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8696,21 +8722,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8745,18 +8773,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8767,13 +8796,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
@@ -8795,29 +8825,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  Sort
    Output: t1.a, (count(((t1.*)::pagg_tab)))
    Sort Key: t1.a
-   ->  Append
-         ->  HashAggregate
-               Output: t1.a, count(((t1.*)::pagg_tab))
-               Group Key: t1.a
-               Filter: (avg(t1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p1 t1
-                     Output: t1.a, t1.*, t1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-         ->  HashAggregate
-               Output: t1_1.a, count(((t1_1.*)::pagg_tab))
-               Group Key: t1_1.a
-               Filter: (avg(t1_1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p2 t1_1
+   ->  HashAggregate
+         Output: t1.a, count(((t1.*)::pagg_tab))
+         Group Key: t1.a
+         Filter: (avg(t1.b) < '22'::numeric)
+         ->  Append
+               Async subplans: 3 
+               ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
                      Output: t1_1.a, t1_1.*, t1_1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-         ->  HashAggregate
-               Output: t1_2.a, count(((t1_2.*)::pagg_tab))
-               Group Key: t1_2.a
-               Filter: (avg(t1_2.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p3 t1_2
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+               ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
                      Output: t1_2.a, t1_2.*, t1_2.b
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+               ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
+                     Output: t1_3.a, t1_3.*, t1_3.b
                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
 
 SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | count 
@@ -8837,20 +8860,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 -----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.b
-   ->  Finalize HashAggregate
+   ->  HashAggregate
          Group Key: pagg_tab.b
          Filter: (sum(pagg_tab.a) < 700)
          ->  Append
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab.b
-                     ->  Foreign Scan on fpagg_tab_p1 pagg_tab
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_1.b
-                     ->  Foreign Scan on fpagg_tab_p2 pagg_tab_1
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_2.b
-                     ->  Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- ===================================================================
 -- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..4bfc2d39ea 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        async;            /* true if run asynchronously */
+    bool        queued;            /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last element in waiter queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,12 +1476,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->async = false;
+    fsstate->queued = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1487,40 +1539,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /*
+     * Do nothing if the node is already in the queue or already eof'ed.
+     * Note: leader node is not marked as queued.
+     */
+    if (leader == node || fsstate->queued || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* no leader means not busy, send request immediately */
+        request_more_data(node);
+    }
+    else
+    {
+        /* the connection is busy, queue the node */
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->queued = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *leader_state = GetPgFdwScanState(node);
+    ForeignScanState *next_leader = leader_state->waiter;
+
+    Assert(leader_state->s.commonstate->leader = node);
+
+    if (next_leader)
+    {
+        /* the first waiter becomes the next leader */
+        PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+        next_leader_state->last_waiter = leader_state->last_waiter;
+        next_leader_state->queued = false;
+    }
+
+    leader_state->waiter = NULL;
+    leader_state->s.commonstate->leader = next_leader;
+
+    return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->queued)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    if (leader == node)
+    {
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        move_to_next_waiter(node);
+
+        return;
+    }
+
+    /*
+     * Just remove the node from the queue
+     *
+     * Nodes don't have a link to the previous node but anyway this function is
+     * called on the shutdown path, so we don't bother seeking for faster way
+     * to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+
+            /* relink to the previous node if the last node was removed */
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+
+            fsstate->queued = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check async state to tell if all tuples received
+ *        (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
-     * Get some more tuples, if we've run out.
-     */
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /*
+             * finish the running query before sending the next command for
+             * this node
+             */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy, queue the request */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /*
+             * The request for the next node cannot be sent before the leader
+             * responds. Finish the current leader if possible.
+             */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* fetch the leader's data and enqueue it for the next request */
+            if (available)
+            {
+                fetch_received_data(leader);
+                add_async_waiter(leader);
+            }
+        }
+    }
+
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        else
+            node->ss.ps.asyncstate = AS_WAITING;
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1535,7 +1788,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1543,6 +1796,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1571,9 +1826,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1591,7 +1846,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1599,15 +1854,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2372,7 +2643,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2457,7 +2730,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2504,8 +2781,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2703,6 +2980,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2747,6 +3025,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3317,11 +3607,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3384,50 +3674,119 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof'ed */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* The node is the current leader, just return. */
+    if (leader == node)
+        return;
+
+    /* Let the node be the leader */
+    if (leader != NULL)
+    {
+        remove_async_node(node);
+        fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+        fsstate->waiter = leader;
+    }
+    else
+    {
+        fsstate->last_waiter = node;
+        fsstate->waiter = NULL;
+    }
+
+    fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* There's some remains. Move them to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
-        char        sql[64];
-        int            numrows;
+        PGconn       *conn = fsstate->s.conn;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
-        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-                 fsstate->fetch_size, fsstate->cursor_number);
-
-        res = pgfdw_exec_query(conn, sql);
-        /* On error, report the original query, not the FETCH. */
+        res = pgfdw_get_result(conn, fsstate->query);
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3437,22 +3796,73 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    Assert(commonstate != NULL);
+
+    /* just return if the connection is already available */
+    if (commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read all of the result for the running
+     * query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3566,7 +3976,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3653,6 +4065,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3680,14 +4095,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3695,10 +4110,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3734,7 +4149,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3744,12 +4159,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3757,9 +4172,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3888,16 +4303,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4056,9 +4471,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4066,10 +4481,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5560,6 +5975,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..359208a12a 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1780,25 +1780,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1840,12 +1840,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1904,8 +1904,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.4


Re: Asynchronous Append on postgres_fdw nodes.

From
Konstantin Knizhnik
Date:

On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
> At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
>> On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
>>> As the result of a discussion with Fujita-san off-list, I'm going to
>>> hold off development until he decides whether mine or Thomas' is
>>> better.
>> The latest patch doesn't apply so I set as WoA.
>> https://commitfest.postgresql.org/29/2491/
> Thanks. This is rebased version.
>
> At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
>> Either way, we definitely need patch 0001.  One comment:
>>
>> -CreateWaitEventSet(MemoryContext context, int nevents)
>> +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
>>
>> I wonder if it's better to have it receive ResourceOwner like that, or
>> to have it capture CurrentResourceOwner.  I think the latter is more
>> common in existing code.
> There's no existing WaitEventSets belonging to a resowner. So
> unconditionally capturing CurrentResourceOwner doesn't work well. I
> could pass a bool instead but that make things more complex.
>
> Come to think of "complex", ExecAsync stuff in this patch might be
> too-much for a short-term solution until executor overhaul, if it
> comes shortly. (the patch of mine here as a whole is like that,
> though..). The queueing stuff in postgres_fdw is, too.
>
> regards.
>


Hi,
Looks like current implementation of asynchronous append incorrectly 
handle LIMIT clause:

psql:append.sql:10: ERROR:  another command is already in progress
CONTEXT:  remote SQL command: CLOSE c1



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Konstantin Knizhnik
Date:

On 22.09.2020 15:52, Konstantin Knizhnik wrote:
>
>
> On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
>> At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby 
>> <pryzby@telsasoft.com> wrote in
>>> On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
>>>> As the result of a discussion with Fujita-san off-list, I'm going to
>>>> hold off development until he decides whether mine or Thomas' is
>>>> better.
>>> The latest patch doesn't apply so I set as WoA.
>>> https://commitfest.postgresql.org/29/2491/
>> Thanks. This is rebased version.
>>
>> At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro 
>> <thomas.munro@gmail.com> wrote in
>>> Either way, we definitely need patch 0001.  One comment:
>>>
>>> -CreateWaitEventSet(MemoryContext context, int nevents)
>>> +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int 
>>> nevents)
>>>
>>> I wonder if it's better to have it receive ResourceOwner like that, or
>>> to have it capture CurrentResourceOwner.  I think the latter is more
>>> common in existing code.
>> There's no existing WaitEventSets belonging to a resowner. So
>> unconditionally capturing CurrentResourceOwner doesn't work well. I
>> could pass a bool instead but that make things more complex.
>>
>> Come to think of "complex", ExecAsync stuff in this patch might be
>> too-much for a short-term solution until executor overhaul, if it
>> comes shortly. (the patch of mine here as a whole is like that,
>> though..). The queueing stuff in postgres_fdw is, too.
>>
>> regards.
>>
>
>
> Hi,
> Looks like current implementation of asynchronous append incorrectly 
> handle LIMIT clause:
>
> psql:append.sql:10: ERROR:  another command is already in progress
> CONTEXT:  remote SQL command: CLOSE c1
>
>
>
Just FYI: the following patch fixes the problem:

--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)

          if (cur == node)
          {
+            PGconn *conn = curstate->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
              prev_state->waiter = curstate->waiter;

              /* relink to the previous node if the last node was removed */

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Konstantin Knizhnik
Date:

On 22.09.2020 16:40, Konstantin Knizhnik wrote:
>
>
> On 22.09.2020 15:52, Konstantin Knizhnik wrote:
>>
>>
>> On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
>>> At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby 
>>> <pryzby@telsasoft.com> wrote in
>>>> On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:
>>>>> As the result of a discussion with Fujita-san off-list, I'm going to
>>>>> hold off development until he decides whether mine or Thomas' is
>>>>> better.
>>>> The latest patch doesn't apply so I set as WoA.
>>>> https://commitfest.postgresql.org/29/2491/
>>> Thanks. This is rebased version.
>>>
>>> At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro 
>>> <thomas.munro@gmail.com> wrote in
>>>> Either way, we definitely need patch 0001.  One comment:
>>>>
>>>> -CreateWaitEventSet(MemoryContext context, int nevents)
>>>> +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int 
>>>> nevents)
>>>>
>>>> I wonder if it's better to have it receive ResourceOwner like that, or
>>>> to have it capture CurrentResourceOwner.  I think the latter is more
>>>> common in existing code.
>>> There's no existing WaitEventSets belonging to a resowner. So
>>> unconditionally capturing CurrentResourceOwner doesn't work well. I
>>> could pass a bool instead but that make things more complex.
>>>
>>> Come to think of "complex", ExecAsync stuff in this patch might be
>>> too-much for a short-term solution until executor overhaul, if it
>>> comes shortly. (the patch of mine here as a whole is like that,
>>> though..). The queueing stuff in postgres_fdw is, too.
>>>
>>> regards.
>>>
>>
>>
>> Hi,
>> Looks like current implementation of asynchronous append incorrectly 
>> handle LIMIT clause:
>>
>> psql:append.sql:10: ERROR:  another command is already in progress
>> CONTEXT:  remote SQL command: CLOSE c1
>>
>>
>>
> Just FYI: the following patch fixes the problem:
>
> --- a/contrib/postgres_fdw/postgres_fdw.c
> +++ b/contrib/postgres_fdw/postgres_fdw.c
> @@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)
>
>          if (cur == node)
>          {
> +            PGconn *conn = curstate->s.conn;
> +
> +            while(PQisBusy(conn))
> +                PQclear(PQgetResult(conn));
> +
>              prev_state->waiter = curstate->waiter;
>
>              /* relink to the previous node if the last node was 
> removed */
>

Sorry, but it is not the only problem.
If you execute the query above and then in the same backend try to 
insert more records, then backend is crashed:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f5dfc59a231 in fetch_received_data (node=0x230c130) at 
postgres_fdw.c:3736
3736        Assert(fsstate->s.commonstate->leader == node);
(gdb) p sstate->s.commonstate
No symbol "sstate" in current context.
(gdb) p fsstate->s.commonstate
Cannot access memory at address 0x7f7f7f7f7f7f7f87

Also my patch doesn't solve the problem for small number of records 
(100) in the table.
I attach yet another patch which fix both problems.
Please notice that I did not go deep inside code of async append, so I 
am not sure that my patch is complete and correct.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Sep 22, 2020 at 9:52 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
> > Come to think of "complex", ExecAsync stuff in this patch might be
> > too-much for a short-term solution until executor overhaul, if it
> > comes shortly. (the patch of mine here as a whole is like that,
> > though..). The queueing stuff in postgres_fdw is, too.

> Looks like current implementation of asynchronous append incorrectly
> handle LIMIT clause:
>
> psql:append.sql:10: ERROR:  another command is already in progress
> CONTEXT:  remote SQL command: CLOSE c1

Thanks for the report (and patch)!

The same issue has already been noticed in [1].  I too think the cause
of the issue would be in the 0003 patch (ie, “the queueing stuff “ in
postgres_fdw), but I’m not sure it is really a good idea to have that
in postgres_fdw in the first place, because it would impact
performance negatively in some cases (see [1]).

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
Your AsyncAppend doesn't switch to another source if the data in current 
leader is available:

/*
  * The request for the next node cannot be sent before the leader
  * responds. Finish the current leader if possible.
  */
if (PQisBusy(leader_state->s.conn))
{
   int rc = WaitLatchOrSocket(NULL, WL_SOCKET_READABLE | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH, PQsocket(leader_state->s.conn), 0, 
WAIT_EVENT_ASYNC_WAIT);
   if (!(rc & WL_SOCKET_READABLE))
     available = false;
}

/* fetch the leader's data and enqueue it for the next request */
if (available)
{
   fetch_received_data(leader);
   add_async_waiter(leader);
}

I don't understand, why it is needed. If we have fdw connections with 
different latency, then we will read data from the fast connection 
first. I think this may be a source of skew and decrease efficiency of 
asynchronous append.

For example, see below synthetic query:
CREATE TABLE l (a integer) PARTITION BY LIST (a);
CREATE FOREIGN TABLE f1 PARTITION OF l FOR VALUES IN (1) SERVER lb 
OPTIONS (table_name 'l1');
CREATE FOREIGN TABLE f2 PARTITION OF l FOR VALUES IN (2) SERVER lb 
OPTIONS (table_name 'l2');

INSERT INTO l (a) SELECT 2 FROM generate_series(1,200) as gs;
INSERT INTO l (a) SELECT 1 FROM generate_series(1,1000) as gs;

EXPLAIN ANALYZE (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) LIMIT 400;

Result:
Limit  (cost=100.00..122.21 rows=400 width=4) (actual time=0.483..1.183 
rows=400 loops=1)
    ->  Append  (cost=100.00..424.75 rows=5850 width=4) (actual 
time=0.482..1.149 rows=400 loops=1)
          ->  Foreign Scan on f1  (cost=100.00..197.75 rows=2925 
width=4) (actual time=0.481..1.115 rows=400 loops=1)
          ->  Foreign Scan on f2  (cost=100.00..197.75 rows=2925 
width=4) (never executed)

As you can see, executor scans one input and doesn't tried to scan another.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Aug 20, 2020 at 4:36 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> This is rebased version.

Thanks for the rebased version!

> Come to think of "complex", ExecAsync stuff in this patch might be
> too-much for a short-term solution until executor overhaul, if it
> comes shortly. (the patch of mine here as a whole is like that,
> though..). The queueing stuff in postgres_fdw is, too.

Here are some review comments on “ExecAsync stuff” (the 0002 patch):

@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)

    i = -1;
    while ((i = bms_next_member(validsubplans, i)) >= 0)
    {
        Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+       int         sub_eflags = eflags;
+
+       /* Let async-capable subplans run asynchronously */
+       if (i < node->nasyncplans)
+       {
+           sub_eflags |= EXEC_FLAG_ASYNC;
+           nasyncplans++;
+       }

This would be more ambitious than Thomas’ patch: his patch only allows
foreign scan nodes beneath an Append node to be executed
asynchronously, but your patch allows any plan nodes beneath it (e.g.,
local child joins between foreign tables).  Right?  I think that would
be great, but I’m not sure how we execute such plan nodes
asynchronously as other parts of your patch seem to assume that only
foreign scan nodes beneath an Append are considered as async-capable.
Maybe I’m missing something, though.  Could you elaborate on that?

Your patch (and the original patch by Robert [1]) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results.  Did you do performance testing on that?  I couldn’t find
that from the archive.

+bool
+is_async_capable_path(Path *path)
+{
+   switch (nodeTag(path))
+   {
+       case T_ForeignPath:
+           {
+               FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+               Assert(fdwroutine != NULL);
+               if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                   fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                   return true;
+           }

Do we really need to introduce the FDW API
IsForeignPathAsyncCapable()?  I think we could determine whether a
foreign path is async-capable, by checking whether the FDW has the
postgresForeignAsyncConfigureWait() API.

In relation to the first comment, I noticed this change in the
postgres_fdw regression tests:

HEAD:

EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
                               QUERY PLAN
------------------------------------------------------------------------
 Sort
   Output: t1.a, (count(((t1.*)::pagg_tab)))
   Sort Key: t1.a
   ->  Append
         ->  HashAggregate
               Output: t1.a, count(((t1.*)::pagg_tab))
               Group Key: t1.a
               Filter: (avg(t1.b) < '22'::numeric)
               ->  Foreign Scan on public.fpagg_tab_p1 t1
                     Output: t1.a, t1.*, t1.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
         ->  HashAggregate
               Output: t1_1.a, count(((t1_1.*)::pagg_tab))
               Group Key: t1_1.a
               Filter: (avg(t1_1.b) < '22'::numeric)
               ->  Foreign Scan on public.fpagg_tab_p2 t1_1
                     Output: t1_1.a, t1_1.*, t1_1.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
         ->  HashAggregate
               Output: t1_2.a, count(((t1_2.*)::pagg_tab))
               Group Key: t1_2.a
               Filter: (avg(t1_2.b) < '22'::numeric)
               ->  Foreign Scan on public.fpagg_tab_p3 t1_2
                     Output: t1_2.a, t1_2.*, t1_2.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(25 rows)

Patched:

EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
                               QUERY PLAN
------------------------------------------------------------------------
 Sort
   Output: t1.a, (count(((t1.*)::pagg_tab)))
   Sort Key: t1.a
   ->  HashAggregate
         Output: t1.a, count(((t1.*)::pagg_tab))
         Group Key: t1.a
         Filter: (avg(t1.b) < '22'::numeric)
         ->  Append
               Async subplans: 3
               ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
                     Output: t1_1.a, t1_1.*, t1_1.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
               ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
                     Output: t1_2.a, t1_2.*, t1_2.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
               ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
                     Output: t1_3.a, t1_3.*, t1_3.b
                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(18 rows)

So, your patch can only handle foreign scan nodes beneath an Append
for now?  Anyway, I think this would lead to the improved efficiency,
considering performance results from Movead [2].  And I think planner
changes to make this happen would be a good thing in your patch.

That’s all I have for now.  Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/2020011417113872105895%40highgo.ca



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thanks for reviewing.

At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> > Come to think of "complex", ExecAsync stuff in this patch might be
> > too-much for a short-term solution until executor overhaul, if it
> > comes shortly. (the patch of mine here as a whole is like that,
> > though..). The queueing stuff in postgres_fdw is, too.
> 
> Here are some review comments on “ExecAsync stuff” (the 0002 patch):
> 
> @@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
> 
>     i = -1;
>     while ((i = bms_next_member(validsubplans, i)) >= 0)
>     {
>         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
> +       int         sub_eflags = eflags;
> +
> +       /* Let async-capable subplans run asynchronously */
> +       if (i < node->nasyncplans)
> +       {
> +           sub_eflags |= EXEC_FLAG_ASYNC;
> +           nasyncplans++;
> +       }
> 
> This would be more ambitious than Thomas’ patch: his patch only allows
> foreign scan nodes beneath an Append node to be executed
> asynchronously, but your patch allows any plan nodes beneath it (e.g.,
> local child joins between foreign tables).  Right?  I think that would

Right. It is intended to work any place, but all upper nodes up to the
common node must be "async-aware and capable" for the machinery to work. So it
doesn't work currently since Append is the only async-aware node.
> be great, but I’m not sure how we execute such plan nodes
> asynchronously as other parts of your patch seem to assume that only
> foreign scan nodes beneath an Append are considered as async-capable.
> Maybe I’m missing something, though.  Could you elaborate on that?

Right about this patch.  As a trial at hand, in my faint memory, some
join methods and some aggregaioion can be async-aware but they are not
included in this patch not to bloat it with more complex stuff.

> Your patch (and the original patch by Robert [1]) modified
> ExecAppend() so that it can process local plan nodes while waiting for
> the results from remote queries, which would be also a feature that’s
> not supported by Thomas’ patch, but I’d like to know performance
> results.  Did you do performance testing on that?  I couldn’t find
> that from the archive.

At least, even though theoretically, I think it's obvious that it's
performant to do something than just sitting waitng for the next tuple
to come from abroad.  (I's not so obvious for slow local
vs. hyperspeed-remotes configuration, but...)

> +bool
> +is_async_capable_path(Path *path)
> +{
> +   switch (nodeTag(path))
> +   {
> +       case T_ForeignPath:
> +           {
> +               FdwRoutine *fdwroutine = path->parent->fdwroutine;
> +
> +               Assert(fdwroutine != NULL);
> +               if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
> +                   fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
> +                   return true;
> +           }
> 
> Do we really need to introduce the FDW API
> IsForeignPathAsyncCapable()?  I think we could determine whether a
> foreign path is async-capable, by checking whether the FDW has the
> postgresForeignAsyncConfigureWait() API.

Note that the API routine takes a path, but it's just that a child
path in a certain form theoretically can obstruct async behavior.

> In relation to the first comment, I noticed this change in the
> postgres_fdw regression tests:
> 
> HEAD:
> 
> EXPLAIN (VERBOSE, COSTS OFF)
> SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Sort
>    Output: t1.a, (count(((t1.*)::pagg_tab)))
>    Sort Key: t1.a
>    ->  Append
>          ->  HashAggregate
>                Output: t1.a, count(((t1.*)::pagg_tab))
>                Group Key: t1.a
>                Filter: (avg(t1.b) < '22'::numeric)
>                ->  Foreign Scan on public.fpagg_tab_p1 t1
>                      Output: t1.a, t1.*, t1.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
>          ->  HashAggregate
>                Output: t1_1.a, count(((t1_1.*)::pagg_tab))
>                Group Key: t1_1.a
>                Filter: (avg(t1_1.b) < '22'::numeric)
>                ->  Foreign Scan on public.fpagg_tab_p2 t1_1
>                      Output: t1_1.a, t1_1.*, t1_1.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
>          ->  HashAggregate
>                Output: t1_2.a, count(((t1_2.*)::pagg_tab))
>                Group Key: t1_2.a
>                Filter: (avg(t1_2.b) < '22'::numeric)
>                ->  Foreign Scan on public.fpagg_tab_p3 t1_2
>                      Output: t1_2.a, t1_2.*, t1_2.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
> (25 rows)
> 
> Patched:
> 
> EXPLAIN (VERBOSE, COSTS OFF)
> SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Sort
>    Output: t1.a, (count(((t1.*)::pagg_tab)))
>    Sort Key: t1.a
>    ->  HashAggregate
>          Output: t1.a, count(((t1.*)::pagg_tab))
>          Group Key: t1.a
>          Filter: (avg(t1.b) < '22'::numeric)
>          ->  Append
>                Async subplans: 3
>                ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
>                      Output: t1_1.a, t1_1.*, t1_1.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
>                ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
>                      Output: t1_2.a, t1_2.*, t1_2.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
>                ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
>                      Output: t1_3.a, t1_3.*, t1_3.b
>                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
> (18 rows)
> 
> So, your patch can only handle foreign scan nodes beneath an Append

Yes, as I wrote above. Append-Foreign is the most promising and
suitable as an example.  (and... Agg/WindowAgg are the hardest nodes
to make async-aware.)

> for now?  Anyway, I think this would lead to the improved efficiency,
> considering performance results from Movead [2].  And I think planner
> changes to make this happen would be a good thing in your patch.

Thanks.

> That’s all I have for now.  Sorry for the delay.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > Here are some review comments on “ExecAsync stuff” (the 0002 patch):
> >
> > @@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
> >
> >     i = -1;
> >     while ((i = bms_next_member(validsubplans, i)) >= 0)
> >     {
> >         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
> > +       int         sub_eflags = eflags;
> > +
> > +       /* Let async-capable subplans run asynchronously */
> > +       if (i < node->nasyncplans)
> > +       {
> > +           sub_eflags |= EXEC_FLAG_ASYNC;
> > +           nasyncplans++;
> > +       }
> >
> > This would be more ambitious than Thomas’ patch: his patch only allows
> > foreign scan nodes beneath an Append node to be executed
> > asynchronously, but your patch allows any plan nodes beneath it (e.g.,
> > local child joins between foreign tables).  Right?  I think that would
>
> Right. It is intended to work any place,

> > be great, but I’m not sure how we execute such plan nodes
> > asynchronously as other parts of your patch seem to assume that only
> > foreign scan nodes beneath an Append are considered as async-capable.
> > Maybe I’m missing something, though.  Could you elaborate on that?
>
> Right about this patch.  As a trial at hand, in my faint memory, some
> join methods and some aggregaioion can be async-aware but they are not
> included in this patch not to bloat it with more complex stuff.

Yeah.  I’m concerned about what was discussed in [1] as well.  I think
it would be better only to allow foreign scan nodes beneath an Append,
as in Thomas’ patch (and the original patch by Robert), at least in
the first cut of this feature.

BTW: I noticed that you changed the ExecProcNode() API so that an
Append calling FDWs can know wether they return tuples immediately or
not:

+   while ((i = bms_first_member(needrequest)) >= 0)
+   {
+       TupleTableSlot *slot;
+       PlanState *subnode = node->appendplans[i];
+
+       slot = ExecProcNode(subnode);
+       if (subnode->asyncstate == AS_AVAILABLE)
+       {
+           if (!TupIsNull(slot))
+           {
+               node->as_asyncresult[node->as_nasyncresult++] = slot;
+               node->as_needrequest = bms_add_member(node->as_needrequest, i);
+           }
+       }
+       else
+           node->as_pending_async = bms_add_member(node->as_pending_async, i);
+   }

In the case of postgres_fdw:

 /*
  * postgresIterateForeignScan
- *     Retrieve next row from the result set, or clear tuple slot to indicate
- *     EOF.
+ *     Retrieve next row from the result set.
+ *
+ *     For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *     For asynchronous nodes, if clear tuple slot is returned, the caller
+ *     needs to check async state to tell if all tuples received
+ *     (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */

That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
whether it returns a tuple immediately or not, and then 2) the Append
knows that from the new flag when the callback routine returns.  I’m
not sure this is a good idea, because it seems likely that the
ExecProcNode() change would affect many other places in the executor,
making maintenance and/or future development difficult.  I think the
FDW callback routines proposed in the original patch by Robert would
provide a cleaner way to do asynchronous execution of FDWs without
changing the ExecProcNode() API, IIUC:

+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.

What is the reason for not doing like this in your patch?

Thanks for the explanation!

Best regards,
Etsuro Fujita

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYrbgTBnLwnr1v%3Dpk%2BC%3DznWg7AgV9%3DM9ehrq6TDexPQNw%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > Your patch (and the original patch by Robert [1]) modified
> > ExecAppend() so that it can process local plan nodes while waiting for
> > the results from remote queries, which would be also a feature that’s
> > not supported by Thomas’ patch, but I’d like to know performance
> > results.

> At least, even though theoretically, I think it's obvious that it's
> performant to do something than just sitting waitng for the next tuple
> to come from abroad.

I did a simple test on my laptop:

create table t1 (a int, b int, c text);
create foreign table p1 (a int, b int, c text) server server1 options
(table_name 't1');
create table p2 (a int, b int, c text);

insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;
insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000') from
generate_series(0, 99999) i;

analyze p1;
vacuum analyze p2;

create table pt (a int, b int, c text) partition by range (a);
alter table pt attach partition p1 for values from (10) to (20);
alter table pt attach partition p2 for values from (20) to (30);

set enable_partitionwise_aggregate to on;

select a, count(*) from pt group by a;

HEAD: 47.734 ms
With your patch: 32.400 ms

This test is pretty simple, but I think this shows that the mentioned
feature would be useful for cases where it takes time to get the
results from remote queries.

Cool!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Wed, 30 Sep 2020 16:30:41 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Mon, Sep 28, 2020 at 10:35 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Sat, 26 Sep 2020 19:45:39 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > > Your patch (and the original patch by Robert [1]) modified
> > > ExecAppend() so that it can process local plan nodes while waiting for
> > > the results from remote queries, which would be also a feature that’s
> > > not supported by Thomas’ patch, but I’d like to know performance
> > > results.
> 
> > At least, even though theoretically, I think it's obvious that it's
> > performant to do something than just sitting waitng for the next tuple
> > to come from abroad.
> 
> I did a simple test on my laptop:
> 
> create table t1 (a int, b int, c text);
> create foreign table p1 (a int, b int, c text) server server1 options
> (table_name 't1');
> create table p2 (a int, b int, c text);
> 
> insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000') from
> generate_series(0, 99999) i;
> insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000') from
> generate_series(0, 99999) i;
> 
> analyze p1;
> vacuum analyze p2;
> 
> create table pt (a int, b int, c text) partition by range (a);
> alter table pt attach partition p1 for values from (10) to (20);
> alter table pt attach partition p2 for values from (20) to (30);
> 
> set enable_partitionwise_aggregate to on;
> 
> select a, count(*) from pt group by a;
> 
> HEAD: 47.734 ms
> With your patch: 32.400 ms
> 
> This test is pretty simple, but I think this shows that the mentioned
> feature would be useful for cases where it takes time to get the
> results from remote queries.
> 
> Cool!

Thanks.  Since it starts all remote nodes before local ones, the
startup gain would be the shorter of the startup time of the fastest
remote and the time required for all local nodes.  Plus remote
transfer gets asynchronous fetch gain.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Thu, 1 Oct 2020 12:56:02 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Oct 01, 2020 at 11:16:53AM +0900, Kyotaro Horiguchi wrote:
> > Thanks.  Since it starts all remote nodes before local ones, the
> > startup gain would be the shorter of the startup time of the fastest
> > remote and the time required for all local nodes.  Plus remote
> > transfer gets asynchronous fetch gain.
> 
> The patch fails to apply per the CF bot.  For now, I have moved it to
> next CF, waiting on author.

Thanks! Rebased.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

From 09a38c30aed31673d3f9360a1853f5f99948f016 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v7 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c            |  2 +-
 src/backend/postmaster/pgstat.c       |  2 +-
 src/backend/postmaster/syslogger.c    |  2 +-
 src/backend/storage/ipc/latch.c       | 20 ++++++--
 src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h           |  4 +-
 src/include/utils/resowner_private.h  |  8 ++++
 7 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ac986c0505..799fa5006d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
                 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+    FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
     AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                       NULL, NULL);
     AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..30020f8cda 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4503,7 +4503,7 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
     /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
     AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index ffcb54968f..a4de6d90e2 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -300,7 +300,7 @@ SysLoggerMain(int argc, char *argv[])
      * syslog pipe, which implies that all other backends have exited
      * (including the postmaster).
      */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 2);
+    wes = CreateWaitEventSet(CurrentMemoryContext, NULL, 2);
     AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 #ifndef WIN32
     AddWaitEventToSet(wes, WL_SOCKET_READABLE, syslogPipe[0], NULL, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 63c6c97536..108a6127e9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -57,6 +57,7 @@
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 #include "utils/memutils.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -85,6 +86,8 @@ struct WaitEventSet
     int            nevents;        /* number of registered events */
     int            nevents_space;    /* maximum number of events in this set */
 
+    ResourceOwner    resowner;    /* Resource owner */
+
     /*
      * Array, of nevents_space length, storing the definition of events this
      * set is waiting for.
@@ -257,7 +260,7 @@ InitializeLatchWaitSet(void)
     Assert(LatchWaitSet == NULL);
 
     /* Set up the WaitEventSet used by WaitLatch(). */
-    LatchWaitSet = CreateWaitEventSet(TopMemoryContext, 2);
+    LatchWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 2);
     latch_pos = AddWaitEventToSet(LatchWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
                                   MyLatch, NULL);
     if (IsUnderPostmaster)
@@ -441,7 +444,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
     int            ret = 0;
     int            rc;
     WaitEvent    event;
-    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+    WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
     if (wakeEvents & WL_TIMEOUT)
         Assert(timeout >= 0);
@@ -608,12 +611,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
     WaitEventSet *set;
     char       *data;
     Size        sz = 0;
 
+    if (res)
+        ResourceOwnerEnlargeWESs(res);
+
     /*
      * Use MAXALIGN size/alignment to guarantee that later uses of memory are
      * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -728,6 +734,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
     StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+    /* Register this wait event set if requested */
+    set->resowner = res;
+    if (res)
+        ResourceOwnerRememberWES(set->resowner, set);
+
     return set;
 }
 
@@ -773,6 +784,9 @@ FreeWaitEventSet(WaitEventSet *set)
     }
 #endif
 
+    if (set->resowner != NULL)
+        ResourceOwnerForgetWES(set->resowner, set);
+
     pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..237ca9fa30 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -128,6 +128,7 @@ typedef struct ResourceOwnerData
     ResourceArray filearr;        /* open temporary files */
     ResourceArray dsmarr;        /* dynamic shmem segments */
     ResourceArray jitarr;        /* JIT contexts */
+    ResourceArray wesarr;        /* wait event sets */
 
     /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
     int            nlocks;            /* number of owned locks */
@@ -175,6 +176,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -444,6 +446,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
     ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
     ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
     ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+    ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
     return owner;
 }
@@ -553,6 +556,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
             jit_release_context(context);
         }
+
+        /* Ditto for wait event sets */
+        while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+        {
+            WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+            if (isCommit)
+                PrintWESLeakWarning(event);
+            FreeWaitEventSet(event);
+        }
     }
     else if (phase == RESOURCE_RELEASE_LOCKS)
     {
@@ -725,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     Assert(owner->filearr.nitems == 0);
     Assert(owner->dsmarr.nitems == 0);
     Assert(owner->jitarr.nitems == 0);
+    Assert(owner->wesarr.nitems == 0);
     Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
     /*
@@ -752,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
     ResourceArrayFree(&(owner->filearr));
     ResourceArrayFree(&(owner->dsmarr));
     ResourceArrayFree(&(owner->jitarr));
+    ResourceArrayFree(&(owner->wesarr));
 
     pfree(owner);
 }
@@ -1370,3 +1385,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
         elog(ERROR, "JIT context %p is not owned by resource owner %s",
              DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+    ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+    ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+        elog(ERROR, "wait event set %p is not owned by resource owner %s",
+             events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+    /*
+     * XXXX: There's no property to show as an identier of a wait event set,
+     * use its pointer instead.
+     */
+    elog(WARNING, "wait event set leak: %p still referenced",
+         events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..ae13d4c08d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -163,7 +164,8 @@ extern void DisownLatch(Latch *latch);
 extern void SetLatch(Latch *latch);
 extern void ResetLatch(Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+                                        ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int    AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
                               Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..7d19dadd57 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
                                    Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+                         WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+                       WaitEventSet *);
+
 #endif                            /* RESOWNER_PRIVATE_H */
-- 
2.18.4

From c47ea326f3557d7ac03886c91fda5ebf689ae068 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH v7 2/3] Infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   1 +
 src/backend/executor/execAsync.c        | 152 +++++++++++
 src/backend/executor/nodeAppend.c       | 342 ++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  21 ++
 src/backend/nodes/bitmapset.c           |  72 +++++
 src/backend/nodes/copyfuncs.c           |   3 +
 src/backend/nodes/outfuncs.c            |   3 +
 src/backend/nodes/readfuncs.c           |   3 +
 src/backend/optimizer/path/allpaths.c   |  24 ++
 src/backend/optimizer/path/costsize.c   |  55 +++-
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/backend/utils/resowner/resowner.c   |   4 +-
 src/include/executor/execAsync.h        |  22 ++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 +
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  23 +-
 src/include/nodes/plannodes.h           |   9 +
 src/include/optimizer/paths.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 24 files changed, 756 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c98c9b5547..097355f6f9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
                                        List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                    ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
                           ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1377,6 +1378,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
         }
         if (plan->parallel_aware)
             appendStringInfoString(es->str, "Parallel ");
+        if (plan->async_capable)
+            appendStringInfoString(es->str, "Async ");
         appendStringInfoString(es->str, pname);
         es->indent++;
     }
@@ -1958,6 +1961,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
         case T_Hash:
             show_hash_info(castNode(HashState, planstate), es);
             break;
+
+        case T_Append:
+            show_append_info(castNode(AppendState, planstate), es);
+            break;
+
         default:
             break;
     }
@@ -2311,6 +2319,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                          ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+    Append *plan = (Append *) astate->ps.plan;
+
+    if (plan->nasyncplans > 0)
+        ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f990c6473a..1004647d4f 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
     execAmi.o \
+    execAsync.o \
     execCurrent.o \
     execExpr.o \
     execExprInterp.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..2b7d1877e0
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,152 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *      Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *      src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+/*
+ * ExecAsyncConfigureWait: Add wait event to the WaitEventSet if needed.
+ *
+ * If reinit is true, the caller didn't reuse existing WaitEventSet.
+ */
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                       void *data, bool reinit)
+{
+    switch (nodeTag(node))
+    {
+    case T_ForeignScanState:
+        return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+                                             wes, data, reinit);
+        break;
+    default:
+            elog(ERROR, "unrecognized node type: %d",
+                (int) nodeTag(node));
+    }
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+    int **p_refind;
+    int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+    /* arg is the address of the variable refind in ExecAsyncEventWait */
+    ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+    *mcbarg->p_refind = NULL;
+    *mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * ExecAsyncEventWait:
+ *
+ * Wait for async events to fire. Returns the Bitmapset of fired events.
+ */
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
+    WaitEventSet *wes;
+    WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+    int noccurred = 0;
+    Bitmapset *fired_events = NULL;
+    int i;
+    int n;
+
+    n = bms_num_members(waitnodes);
+    wes = CreateWaitEventSet(TopTransactionContext,
+                             TopTransactionResourceOwner, n);
+    if (refindsize < n)
+    {
+        if (refindsize == 0)
+            refindsize = EVENT_BUFFER_SIZE; /* XXX */
+        while (refindsize < n)
+            refindsize *= 2;
+        if (refind)
+            refind = (int *) repalloc(refind, refindsize * sizeof(int));
+        else
+        {
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+            MemoryContext oldctxt =
+                MemoryContextSwitchTo(TopTransactionContext);
+
+            /*
+             * refind points to a memory block in
+             * TopTransactionContext. Register a callback to reset it.
+             */
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+            refind = (int *) palloc(refindsize * sizeof(int));
+            MemoryContextSwitchTo(oldctxt);
+        }
+    }
+
+    /* Prepare WaitEventSet for waiting on the waitnodes. */
+    n = 0;
+    for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+         i = bms_next_member(waitnodes, i))
+    {
+        refind[i] = i;
+        if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+            n++;
+    }
+
+    /* Return immediately if no node to wait. */
+    if (n == 0)
+    {
+        FreeWaitEventSet(wes);
+        return NULL;
+    }
+
+    noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+                                 EVENT_BUFFER_SIZE,
+                                 WAIT_EVENT_ASYNC_WAIT);
+    FreeWaitEventSet(wes);
+    if (noccurred == 0)
+        return NULL;
+
+    for (i = 0 ; i < noccurred ; i++)
+    {
+        WaitEvent *w = &occurred_event[i];
+
+        if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+        {
+            int n = *(int*)w->user_data;
+
+            fired_events = bms_add_member(fired_events, n);
+        }
+    }
+
+    return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 88919e62fa..60c36ee048 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -80,6 +81,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX        -1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -103,22 +105,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
     PlanState **appendplanstates;
     Bitmapset  *validsubplans;
     int            nplans;
+    int            nasyncplans;
     int            firstvalid;
     int            i,
                 j;
 
     /* check for unsupported flags */
-    Assert(!(eflags & EXEC_FLAG_MARK));
+    Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
     /*
      * create new AppendState for our append node
      */
     appendstate->ps.plan = (Plan *) node;
     appendstate->ps.state = estate;
-    appendstate->ps.ExecProcNode = ExecAppend;
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+    appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
     /* If run-time partition pruning is enabled, then set that up now */
     if (node->part_prune_info != NULL)
@@ -152,11 +154,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
         /*
          * When no run-time pruning is required and there's at least one
-         * subplan, we can fill as_valid_subplans immediately, preventing
+         * subplan, we can fill as_valid_syncsubplans immediately, preventing
          * later calls to ExecFindMatchingSubPlans.
          */
         if (!prunestate->do_exec_prune && nplans > 0)
-            appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
+            appendstate->as_valid_syncsubplans =
+                bms_add_range(NULL, node->nasyncplans, nplans - 1);
     }
     else
     {
@@ -167,8 +170,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
          * subplans as valid; they must also all be initialized.
          */
         Assert(nplans > 0);
-        appendstate->as_valid_subplans = validsubplans =
-            bms_add_range(NULL, 0, nplans - 1);
+        validsubplans =    bms_add_range(NULL, 0, nplans - 1);
+        appendstate->as_valid_syncsubplans =
+            bms_add_range(NULL, node->nasyncplans, nplans - 1);
         appendstate->as_prune_state = NULL;
     }
 
@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
      */
     j = 0;
     firstvalid = nplans;
+    nasyncplans = 0;
+
     i = -1;
     while ((i = bms_next_member(validsubplans, i)) >= 0)
     {
         Plan       *initNode = (Plan *) list_nth(node->appendplans, i);
+        int            sub_eflags = eflags;
+
+        /* Let async-capable subplans run asynchronously */
+        if (i < node->nasyncplans)
+        {
+            sub_eflags |= EXEC_FLAG_ASYNC;
+            nasyncplans++;
+        }
 
         /*
          * Record the lowest appendplans index which is a valid partial plan.
@@ -203,13 +217,46 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
         if (i >= node->first_partial_plan && j < firstvalid)
             firstvalid = j;
 
-        appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+        appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
     }
 
     appendstate->as_first_partial_plan = firstvalid;
     appendstate->appendplans = appendplanstates;
     appendstate->as_nplans = nplans;
 
+    /* fill in async stuff */
+    appendstate->as_nasyncplans = nasyncplans;
+    appendstate->as_syncdone = (nasyncplans == nplans);
+    appendstate->as_exec_prune = false;
+
+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+    if (appendstate->as_nasyncplans)
+    {
+        appendstate->as_asyncresult = (TupleTableSlot **)
+            palloc0(appendstate->as_nasyncplans * sizeof(TupleTableSlot *));
+
+        /* initially, all async requests need a request */
+        appendstate->as_needrequest =
+            bms_add_range(NULL, 0, appendstate->as_nasyncplans - 1);
+
+        /*
+         * ExecAppendAsync needs as_valid_syncsubplans to handle async
+         * subnodes.
+         */
+        if (appendstate->as_prune_state != NULL &&
+            appendstate->as_prune_state->do_exec_prune)
+        {
+            Assert(appendstate->as_valid_syncsubplans == NULL);
+
+            appendstate->as_exec_prune = true;
+        }
+    }
+
     /*
      * Miscellaneous initialization
      */
@@ -233,7 +280,7 @@ ExecAppend(PlanState *pstate)
 {
     AppendState *node = castNode(AppendState, pstate);
 
-    if (node->as_whichplan < 0)
+    if (node->as_whichsyncplan < 0)
     {
         /* Nothing to do if there are no subplans */
         if (node->as_nplans == 0)
@@ -243,11 +290,13 @@ ExecAppend(PlanState *pstate)
          * If no subplan has been chosen, we must choose one before
          * proceeding.
          */
-        if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+        if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
             !node->choose_next_subplan(node))
             return ExecClearTuple(node->ps.ps_ResultTupleSlot);
     }
 
+    Assert(node->as_nasyncplans == 0);
+
     for (;;)
     {
         PlanState  *subnode;
@@ -258,8 +307,9 @@ ExecAppend(PlanState *pstate)
         /*
          * figure out which subplan we are currently processing
          */
-        Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-        subnode = node->appendplans[node->as_whichplan];
+        Assert(node->as_whichsyncplan >= 0 &&
+               node->as_whichsyncplan < node->as_nplans);
+        subnode = node->appendplans[node->as_whichsyncplan];
 
         /*
          * get a tuple from the subplan
@@ -282,6 +332,172 @@ ExecAppend(PlanState *pstate)
     }
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+    AppendState *node = castNode(AppendState, pstate);
+    Bitmapset *needrequest;
+    int    i;
+
+    Assert(node->as_nasyncplans > 0);
+
+restart:
+    if (node->as_nasyncresult > 0)
+    {
+        --node->as_nasyncresult;
+        return node->as_asyncresult[node->as_nasyncresult];
+    }
+
+    if (node->as_exec_prune)
+    {
+        Bitmapset *valid_subplans =
+            ExecFindMatchingSubPlans(node->as_prune_state);
+
+        /* Distribute valid subplans into sync and async */
+        node->as_needrequest =
+            bms_intersect(node->as_needrequest, valid_subplans);
+        node->as_valid_syncsubplans =
+            bms_difference(valid_subplans, node->as_needrequest);
+
+        node->as_exec_prune = false;
+    }
+
+    needrequest = node->as_needrequest;
+    node->as_needrequest = NULL;
+    while ((i = bms_first_member(needrequest)) >= 0)
+    {
+        TupleTableSlot *slot;
+        PlanState *subnode = node->appendplans[i];
+
+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
+        {
+            if (!TupIsNull(slot))
+            {
+                node->as_asyncresult[node->as_nasyncresult++] = slot;
+                node->as_needrequest = bms_add_member(node->as_needrequest, i);
+            }
+        }
+        else
+            node->as_pending_async = bms_add_member(node->as_pending_async, i);
+    }
+    bms_free(needrequest);
+
+    for (;;)
+    {
+        TupleTableSlot *result;
+
+        /* return now if a result is available */
+        if (node->as_nasyncresult > 0)
+        {
+            --node->as_nasyncresult;
+            return node->as_asyncresult[node->as_nasyncresult];
+        }
+
+        while (!bms_is_empty(node->as_pending_async))
+        {
+            /* Don't wait for async nodes if any sync node exists. */
+            long timeout = node->as_syncdone ? -1 : 0;
+            Bitmapset *fired;
+            int i;
+
+            fired = ExecAsyncEventWait(node->appendplans,
+                                       node->as_pending_async,
+                                       timeout);
+
+            if (bms_is_empty(fired) && node->as_syncdone)
+            {
+                /*
+                 * We come here when all the subnodes had fired before
+                 * waiting. Retry fetching from the nodes.
+                 */
+                node->as_needrequest = node->as_pending_async;
+                node->as_pending_async = NULL;
+                goto restart;
+            }
+
+            while ((i = bms_first_member(fired)) >= 0)
+            {
+                TupleTableSlot *slot;
+                PlanState *subnode = node->appendplans[i];
+                slot = ExecProcNode(subnode);
+
+                Assert(subnode->asyncstate == AS_AVAILABLE);
+
+                if (!TupIsNull(slot))
+                {
+                    node->as_asyncresult[node->as_nasyncresult++] = slot;
+                    node->as_needrequest =
+                        bms_add_member(node->as_needrequest, i);
+                }
+
+                node->as_pending_async =
+                    bms_del_member(node->as_pending_async, i);
+            }
+            bms_free(fired);
+
+            /* return now if a result is available */
+            if (node->as_nasyncresult > 0)
+            {
+                --node->as_nasyncresult;
+                return node->as_asyncresult[node->as_nasyncresult];
+            }
+
+            if (!node->as_syncdone)
+                break;
+        }
+
+        /*
+         * If there is no asynchronous activity still pending and the
+         * synchronous activity is also complete, we're totally done scanning
+         * this node.  Otherwise, we're done with the asynchronous stuff but
+         * must continue scanning the synchronous children.
+         */
+
+        if (!node->as_syncdone &&
+            node->as_whichsyncplan == INVALID_SUBPLAN_INDEX)
+            node->as_syncdone = !node->choose_next_subplan(node);
+
+        if (node->as_syncdone)
+        {
+            Assert(bms_is_empty(node->as_pending_async));
+            return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+        }
+
+        /*
+         * get a tuple from the subplan
+         */
+        result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+        if (!TupIsNull(result))
+        {
+            /*
+             * If the subplan gave us something then return it as-is. We do
+             * NOT make use of the result slot that was set up in
+             * ExecInitAppend; there's no need for it.
+             */
+            return result;
+        }
+
+        /*
+         * Go on to the "next" subplan. If no more subplans, return the empty
+         * slot set up for us by ExecInitAppend, unless there are async plans
+         * we have yet to finish.
+         */
+        if (!node->choose_next_subplan(node))
+        {
+            node->as_syncdone = true;
+            if (bms_is_empty(node->as_pending_async))
+            {
+                Assert(bms_is_empty(node->as_needrequest));
+                return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+            }
+        }
+
+        /* Else loop back and try to get a tuple from the new subplan */
+    }
+}
+
 /* ----------------------------------------------------------------
  *        ExecEndAppend
  *
@@ -324,10 +540,18 @@ ExecReScanAppend(AppendState *node)
         bms_overlap(node->ps.chgParam,
                     node->as_prune_state->execparamids))
     {
-        bms_free(node->as_valid_subplans);
-        node->as_valid_subplans = NULL;
+        bms_free(node->as_valid_syncsubplans);
+        node->as_valid_syncsubplans = NULL;
     }
 
+    /* Reset async state. */
+    for (i = 0; i < node->as_nasyncplans; ++i)
+        ExecShutdownNode(node->appendplans[i]);
+
+    node->as_nasyncresult = 0;
+    node->as_needrequest = bms_add_range(NULL, 0, node->as_nasyncplans - 1);
+    node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
     for (i = 0; i < node->as_nplans; i++)
     {
         PlanState  *subnode = node->appendplans[i];
@@ -348,7 +572,7 @@ ExecReScanAppend(AppendState *node)
     }
 
     /* Let choose_next_subplan_* function handle setting the first subplan */
-    node->as_whichplan = INVALID_SUBPLAN_INDEX;
+    node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -436,7 +660,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-    int            whichplan = node->as_whichplan;
+    int            whichplan = node->as_whichsyncplan;
     int            nextplan;
 
     /* We should never be called when there are no subplans */
@@ -451,10 +675,18 @@ choose_next_subplan_locally(AppendState *node)
      */
     if (whichplan == INVALID_SUBPLAN_INDEX)
     {
-        if (node->as_valid_subplans == NULL)
-            node->as_valid_subplans =
+        /* Shouldn't have an active async node */
+        Assert(bms_is_empty(node->as_needrequest));
+
+        if (node->as_valid_syncsubplans == NULL)
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
+        /* Exclude async plans */
+        if (node->as_nasyncplans > 0)
+            bms_del_range(node->as_valid_syncsubplans,
+                          0, node->as_nasyncplans - 1);
+
         whichplan = -1;
     }
 
@@ -462,14 +694,14 @@ choose_next_subplan_locally(AppendState *node)
     Assert(whichplan >= -1 && whichplan <= node->as_nplans);
 
     if (ScanDirectionIsForward(node->ps.state->es_direction))
-        nextplan = bms_next_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_next_member(node->as_valid_syncsubplans, whichplan);
     else
-        nextplan = bms_prev_member(node->as_valid_subplans, whichplan);
+        nextplan = bms_prev_member(node->as_valid_syncsubplans, whichplan);
 
     if (nextplan < 0)
         return false;
 
-    node->as_whichplan = nextplan;
+    node->as_whichsyncplan = nextplan;
 
     return true;
 }
@@ -490,29 +722,29 @@ choose_next_subplan_for_leader(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
     {
         /* Mark just-completed subplan as finished. */
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
     }
     else
     {
         /* Start with last subplan. */
-        node->as_whichplan = node->as_nplans - 1;
+        node->as_whichsyncplan = node->as_nplans - 1;
 
         /*
          * If we've yet to determine the valid subplans then do so now.  If
          * run-time pruning is disabled then the valid subplans will always be
          * set to all subplans.
          */
-        if (node->as_valid_subplans == NULL)
+        if (node->as_valid_syncsubplans == NULL)
         {
-            node->as_valid_subplans =
+            node->as_valid_syncsubplans =
                 ExecFindMatchingSubPlans(node->as_prune_state);
 
             /*
@@ -524,26 +756,26 @@ choose_next_subplan_for_leader(AppendState *node)
     }
 
     /* Loop until we find a subplan to execute. */
-    while (pstate->pa_finished[node->as_whichplan])
+    while (pstate->pa_finished[node->as_whichsyncplan])
     {
-        if (node->as_whichplan == 0)
+        if (node->as_whichsyncplan == 0)
         {
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-            node->as_whichplan = INVALID_SUBPLAN_INDEX;
+            node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
             LWLockRelease(&pstate->pa_lock);
             return false;
         }
 
         /*
-         * We needn't pay attention to as_valid_subplans here as all invalid
+         * We needn't pay attention to as_valid_syncsubplans here as all invalid
          * plans have been marked as finished.
          */
-        node->as_whichplan--;
+        node->as_whichsyncplan--;
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -571,23 +803,23 @@ choose_next_subplan_for_worker(AppendState *node)
     /* Backward scan is not supported by parallel-aware plans */
     Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
-    /* We should never be called when there are no subplans */
-    Assert(node->as_nplans > 0);
+    /* We should never be called when there are no sync subplans */
+    Assert(node->as_nplans > node->as_nasyncplans);
 
     LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
     /* Mark just-completed subplan as finished. */
-    if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     /*
      * If we've yet to determine the valid subplans then do so now.  If
      * run-time pruning is disabled then the valid subplans will always be set
      * to all subplans.
      */
-    else if (node->as_valid_subplans == NULL)
+    else if (node->as_valid_syncsubplans == NULL)
     {
-        node->as_valid_subplans =
+        node->as_valid_syncsubplans =
             ExecFindMatchingSubPlans(node->as_prune_state);
         mark_invalid_subplans_as_finished(node);
     }
@@ -600,30 +832,30 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Save the plan from which we are starting the search. */
-    node->as_whichplan = pstate->pa_next_plan;
+    node->as_whichsyncplan = pstate->pa_next_plan;
 
     /* Loop until we find a valid subplan to execute. */
     while (pstate->pa_finished[pstate->pa_next_plan])
     {
         int            nextplan;
 
-        nextplan = bms_next_member(node->as_valid_subplans,
+        nextplan = bms_next_member(node->as_valid_syncsubplans,
                                    pstate->pa_next_plan);
         if (nextplan >= 0)
         {
             /* Advance to the next valid plan. */
             pstate->pa_next_plan = nextplan;
         }
-        else if (node->as_whichplan > node->as_first_partial_plan)
+        else if (node->as_whichsyncplan > node->as_first_partial_plan)
         {
             /*
              * Try looping back to the first valid partial plan, if there is
              * one.  If there isn't, arrange to bail out below.
              */
-            nextplan = bms_next_member(node->as_valid_subplans,
+            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                        node->as_first_partial_plan - 1);
             pstate->pa_next_plan =
-                nextplan < 0 ? node->as_whichplan : nextplan;
+                nextplan < 0 ? node->as_whichsyncplan : nextplan;
         }
         else
         {
@@ -631,10 +863,10 @@ choose_next_subplan_for_worker(AppendState *node)
              * At last plan, and either there are no partial plans or we've
              * tried them all.  Arrange to bail out.
              */
-            pstate->pa_next_plan = node->as_whichplan;
+            pstate->pa_next_plan = node->as_whichsyncplan;
         }
 
-        if (pstate->pa_next_plan == node->as_whichplan)
+        if (pstate->pa_next_plan == node->as_whichsyncplan)
         {
             /* We've tried everything! */
             pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -644,8 +876,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* Pick the plan we found, and advance pa_next_plan one more time. */
-    node->as_whichplan = pstate->pa_next_plan;
-    pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+    node->as_whichsyncplan = pstate->pa_next_plan;
+    pstate->pa_next_plan = bms_next_member(node->as_valid_syncsubplans,
                                            pstate->pa_next_plan);
 
     /*
@@ -654,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
      */
     if (pstate->pa_next_plan < 0)
     {
-        int            nextplan = bms_next_member(node->as_valid_subplans,
+        int            nextplan = bms_next_member(node->as_valid_syncsubplans,
                                                node->as_first_partial_plan - 1);
 
         if (nextplan >= 0)
@@ -671,8 +903,8 @@ choose_next_subplan_for_worker(AppendState *node)
     }
 
     /* If non-partial, immediately mark as finished. */
-    if (node->as_whichplan < node->as_first_partial_plan)
-        node->as_pstate->pa_finished[node->as_whichplan] = true;
+    if (node->as_whichsyncplan < node->as_first_partial_plan)
+        node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
     LWLockRelease(&pstate->pa_lock);
 
@@ -699,13 +931,13 @@ mark_invalid_subplans_as_finished(AppendState *node)
     Assert(node->as_prune_state);
 
     /* Nothing to do if all plans are valid */
-    if (bms_num_members(node->as_valid_subplans) == node->as_nplans)
+    if (bms_num_members(node->as_valid_syncsubplans) == node->as_nplans)
         return;
 
     /* Mark all non-valid plans as finished */
     for (i = 0; i < node->as_nplans; i++)
     {
-        if (!bms_is_member(i, node->as_valid_subplans))
+        if (!bms_is_member(i, node->as_valid_syncsubplans))
             node->as_pstate->pa_finished[i] = true;
     }
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 513471ab9b..3bf4aaa63d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -141,6 +141,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
     scanstate->ss.ps.plan = (Plan *) node;
     scanstate->ss.ps.state = estate;
     scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+    scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+    if ((eflags & EXEC_FLAG_ASYNC) != 0)
+        scanstate->fs_async = true;
 
     /*
      * Miscellaneous initialization
@@ -384,3 +388,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
     if (fdwroutine->ShutdownForeignScan)
         fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *        ExecAsyncForeignScanConfigureWait
+ *
+ *        In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                              void *caller_data, bool reinit)
+{
+    FdwRoutine *fdwroutine = node->fdwroutine;
+
+    Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+    return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+                                                 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 2719ea45a3..05b625783b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -895,6 +895,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
     return a;
 }
 
+/*
+ * bms_del_range
+ *        Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+    int            lwordnum,
+                lbitnum,
+                uwordnum,
+                ushiftbits,
+                wordnum;
+
+    if (lower < 0 || upper < 0)
+        elog(ERROR, "negative bitmapset member not allowed");
+    if (lower > upper)
+        elog(ERROR, "lower range must not be above upper range");
+    uwordnum = WORDNUM(upper);
+
+    if (a == NULL)
+    {
+        a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+    }
+
+    /* ensure we have enough words to store the upper bit */
+    else if (uwordnum >= a->nwords)
+    {
+        int            oldnwords = a->nwords;
+        int            i;
+
+        a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+        a->nwords = uwordnum + 1;
+        /* zero out the enlarged portion */
+        for (i = oldnwords; i < a->nwords; i++)
+            a->words[i] = 0;
+    }
+
+    wordnum = lwordnum = WORDNUM(lower);
+
+    lbitnum = BITNUM(lower);
+    ushiftbits = BITNUM(upper) + 1;
+
+    /*
+     * Special case when lwordnum is the same as uwordnum we must perform the
+     * upper and lower masking on the word.
+     */
+    if (lwordnum == uwordnum)
+    {
+        a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+                                | (~(bitmapword) 0) << ushiftbits);
+    }
+    else
+    {
+        /* turn off lbitnum and all bits left of it */
+        a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+        /* turn off all bits for any intermediate words */
+        while (wordnum < uwordnum)
+            a->words[wordnum++] = (bitmapword) 0;
+
+        /* turn off upper's bit and all bits right of it. */
+        a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+    }
+
+    return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82..4eff3712b7 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -121,6 +121,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
     COPY_SCALAR_FIELD(plan_width);
     COPY_SCALAR_FIELD(parallel_aware);
     COPY_SCALAR_FIELD(parallel_safe);
+    COPY_SCALAR_FIELD(async_capable);
     COPY_SCALAR_FIELD(plan_node_id);
     COPY_NODE_FIELD(targetlist);
     COPY_NODE_FIELD(qual);
@@ -246,6 +247,8 @@ _copyAppend(const Append *from)
     COPY_NODE_FIELD(appendplans);
     COPY_SCALAR_FIELD(first_partial_plan);
     COPY_NODE_FIELD(part_prune_info);
+    COPY_SCALAR_FIELD(nasyncplans);
+    COPY_SCALAR_FIELD(referent);
 
     return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f0386480ab..2b1b0e9141 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -334,6 +334,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
     WRITE_INT_FIELD(plan_width);
     WRITE_BOOL_FIELD(parallel_aware);
     WRITE_BOOL_FIELD(parallel_safe);
+    WRITE_BOOL_FIELD(async_capable);
     WRITE_INT_FIELD(plan_node_id);
     WRITE_NODE_FIELD(targetlist);
     WRITE_NODE_FIELD(qual);
@@ -436,6 +437,8 @@ _outAppend(StringInfo str, const Append *node)
     WRITE_NODE_FIELD(appendplans);
     WRITE_INT_FIELD(first_partial_plan);
     WRITE_NODE_FIELD(part_prune_info);
+    WRITE_INT_FIELD(nasyncplans);
+    WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..63af7c02d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1572,6 +1572,7 @@ ReadCommonPlan(Plan *local_node)
     READ_INT_FIELD(plan_width);
     READ_BOOL_FIELD(parallel_aware);
     READ_BOOL_FIELD(parallel_safe);
+    READ_BOOL_FIELD(async_capable);
     READ_INT_FIELD(plan_node_id);
     READ_NODE_FIELD(targetlist);
     READ_NODE_FIELD(qual);
@@ -1672,6 +1673,8 @@ _readAppend(void)
     READ_NODE_FIELD(appendplans);
     READ_INT_FIELD(first_partial_plan);
     READ_NODE_FIELD(part_prune_info);
+    READ_INT_FIELD(nasyncplans);
+    READ_INT_FIELD(referent);
 
     READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b399592ff8..17e9a7a897 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3973,6 +3973,30 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
     list_free(live_children);
 }
 
+/*
+ * is_projection_capable_path
+ *        Check whether a given Path node is async-capable.
+ */
+bool
+is_async_capable_path(Path *path)
+{
+    switch (nodeTag(path))
+    {
+        case T_ForeignPath:
+            {
+                FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+                Assert(fdwroutine != NULL);
+                if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+                    fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+                    return true;
+            }
+        default:
+            break;
+    }
+    return false;
+}
+
 
 /*****************************************************************************
  *            DEBUG SUPPORT
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index cd3716d494..143e00b13e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2048,22 +2048,59 @@ cost_append(AppendPath *apath)
 
         if (pathkeys == NIL)
         {
-            Path       *subpath = (Path *) linitial(apath->subpaths);
-
-            /*
-             * For an unordered, non-parallel-aware Append we take the startup
-             * cost as the startup cost of the first subpath.
-             */
-            apath->path.startup_cost = subpath->startup_cost;
+            Cost        first_nonasync_startup_cost = -1.0;
+            Cost        async_min_startup_cost = -1;
+            Cost        async_max_cost = 0.0;
 
             /* Compute rows and costs as sums of subplan rows and costs. */
             foreach(l, apath->subpaths)
             {
                 Path       *subpath = (Path *) lfirst(l);
 
+                /*
+                 * For an unordered, non-parallel-aware Append we take the
+                 * startup cost as the startup cost of the first
+                 * nonasync-capable subpath or the minimum startup cost of
+                 * async-capable subpaths.
+                 */
+                if (!is_async_capable_path(subpath))
+                {
+                    if (first_nonasync_startup_cost < 0.0)
+                        first_nonasync_startup_cost = subpath->startup_cost;
+
+                    apath->path.total_cost += subpath->total_cost;
+                }
+                else
+                {
+                    if (async_min_startup_cost < 0.0 ||
+                        async_min_startup_cost > subpath->startup_cost)
+                        async_min_startup_cost = subpath->startup_cost;
+
+                    /*
+                     * It's not obvious how to determine the total cost of
+                     * async subnodes. Although it is not always true, we
+                     * assume it is the maximum cost among all async subnodes.
+                     */
+                    if (async_max_cost < subpath->total_cost)
+                        async_max_cost = subpath->total_cost;
+                }
+
                 apath->path.rows += subpath->rows;
-                apath->path.total_cost += subpath->total_cost;
             }
+
+            /*
+             * If there's an sync subnodes, the startup cost is the startup
+             * cost of the first sync subnode. Otherwise it's the minimal
+             * startup cost of async subnodes.
+             */
+            if (first_nonasync_startup_cost >= 0.0)
+                apath->path.startup_cost = first_nonasync_startup_cost;
+            else
+                apath->path.startup_cost = async_min_startup_cost;
+
+            /* Use async maximum cost if it exceeds the sync total cost */
+            if (async_max_cost > apath->path.total_cost)
+                apath->path.total_cost = async_max_cost;
         }
         else
         {
@@ -2084,6 +2121,8 @@ cost_append(AppendPath *apath)
              * This case is also different from the above in that we have to
              * account for possibly injecting sorts into subpaths that aren't
              * natively ordered.
+             *
+             * Note: An ordered append won't be run asynchronously.
              */
             foreach(l, apath->subpaths)
             {
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3d7a4e373f..3ae46ed6f1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1082,6 +1082,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     bool        tlist_was_changed = false;
     List       *pathkeys = best_path->path.pathkeys;
     List       *subplans = NIL;
+    List       *asyncplans = NIL;
+    List       *syncplans = NIL;
+    List       *asyncpaths = NIL;
+    List       *syncpaths = NIL;
+    List       *newsubpaths = NIL;
     ListCell   *subpaths;
     RelOptInfo *rel = best_path->path.parent;
     PartitionPruneInfo *partpruneinfo = NULL;
@@ -1090,6 +1095,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     Oid           *nodeSortOperators = NULL;
     Oid           *nodeCollations = NULL;
     bool       *nodeNullsFirst = NULL;
+    int            nasyncplans = 0;
+    bool        first = true;
+    bool        referent_is_sync = true;
 
     /*
      * The subpaths list could be empty, if every child was proven empty by
@@ -1219,9 +1227,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
             }
         }
 
-        subplans = lappend(subplans, subplan);
+        /*
+         * Classify as async-capable or not. If we have decided to run the
+         * children in parallel, we cannot any one of them run asynchronously.
+         * Planner thinks that all subnodes are executed in order if this
+         * append is orderd. No subpaths cannot be run asynchronously in that
+         * case.
+         */
+        if (pathkeys == NIL &&
+            !best_path->path.parallel_safe && is_async_capable_path(subpath))
+        {
+            subplan->async_capable = true;
+            asyncplans = lappend(asyncplans, subplan);
+            asyncpaths = lappend(asyncpaths, subpath);
+            ++nasyncplans;
+            if (first)
+                referent_is_sync = false;
+        }
+        else
+        {
+            syncplans = lappend(syncplans, subplan);
+            syncpaths = lappend(syncpaths, subpath);
+        }
+
+        first = false;
     }
 
+    /*
+     * subplan contains asyncplans in the first half, if any, and sync plans in
+     * another half, if any.  We need that the same for subpaths to make
+     * partition pruning information in sync with subplans.
+     */
+    subplans = list_concat(asyncplans, syncplans);
+    newsubpaths = list_concat(asyncpaths, syncpaths);
+
     /*
      * If any quals exist, they may be useful to perform further partition
      * pruning during execution.  Gather information needed by the executor to
@@ -1249,7 +1288,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
         if (prunequal != NIL)
             partpruneinfo =
                 make_partition_pruneinfo(root, rel,
-                                         best_path->subpaths,
+                                         newsubpaths,
                                          best_path->partitioned_rels,
                                          prunequal);
     }
@@ -1257,6 +1296,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
     plan->appendplans = subplans;
     plan->first_partial_plan = best_path->first_partial_path;
     plan->part_prune_info = partpruneinfo;
+    plan->nasyncplans = nasyncplans;
+    plan->referent = referent_is_sync ? nasyncplans : 0;
 
     copy_generic_path_info(&plan->plan, (Path *) best_path);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 30020f8cda..faed3f2442 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3878,6 +3878,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_XACT_GROUP_UPDATE:
             event_name = "XactGroupUpdate";
             break;
+        case WAIT_EVENT_ASYNC_WAIT:
+            event_name = "AsyncExecWait";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 62023c20b2..07aeb43a7f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4574,10 +4574,14 @@ set_deparse_plan(deparse_namespace *dpns, Plan *plan)
      * tlists according to one of the children, and the first one is the most
      * natural choice.  Likewise special-case ModifyTable to pretend that the
      * first child plan is the OUTER referent; this is to support RETURNING
-     * lists containing references to non-target relations.
+     * lists containing references to non-target relations. For Append, use the
+     * explicitly specified referent.
      */
     if (IsA(plan, Append))
-        dpns->outer_plan = linitial(((Append *) plan)->appendplans);
+    {
+        Append *app = (Append *) plan;
+        dpns->outer_plan = list_nth(app->appendplans, app->referent);
+    }
     else if (IsA(plan, MergeAppend))
         dpns->outer_plan = linitial(((MergeAppend *) plan)->mergeplans);
     else if (IsA(plan, ModifyTable))
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 237ca9fa30..27742a1641 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1416,7 +1416,7 @@ void
 ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
@@ -1431,7 +1431,7 @@ static void
 PrintWESLeakWarning(WaitEventSet *events)
 {
     /*
-     * XXXX: There's no property to show as an identier of a wait event set,
+     * XXXX: There's no property to show as an identifier of a wait event set,
      * use its pointer instead.
      */
     elog(WARNING, "wait event set leak: %p still referenced",
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..3b6bf4a516
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,22 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *        Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *        src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+                                   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+                                     long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 415e117407..9cf2c1f676 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -59,6 +59,7 @@
 #define EXEC_FLAG_MARK            0x0008    /* need mark/restore */
 #define EXEC_FLAG_SKIP_TRIGGERS 0x0010    /* skip AfterTrigger calls */
 #define EXEC_FLAG_WITH_NO_DATA    0x0020    /* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC            0x0040    /* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 326d713ebf..71a233b41f 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
                                             ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+                                          WaitEventSet *wes,
+                                          void *caller_data, bool reinit);
 
 #endif                            /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 95556dfb15..853ba2b5ad 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -169,6 +169,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+                                                    WaitEventSet *wes,
+                                                    void *caller_data,
+                                                    bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -190,6 +195,7 @@ typedef struct FdwRoutine
     GetForeignPlan_function GetForeignPlan;
     BeginForeignScan_function BeginForeignScan;
     IterateForeignScan_function IterateForeignScan;
+    IterateForeignScan_function IterateForeignScanAsync;
     ReScanForeignScan_function ReScanForeignScan;
     EndForeignScan_function EndForeignScan;
 
@@ -242,6 +248,11 @@ typedef struct FdwRoutine
     InitializeDSMForeignScan_function InitializeDSMForeignScan;
     ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
     InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+    /* Support functions for asynchronous execution */
+    IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+    ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
     ShutdownForeignScan_function ShutdownForeignScan;
 
     /* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index d113c271ee..177e6218cb 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -107,6 +107,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ef448d67c7..dce7fb0e07 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -925,6 +925,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+    AS_AVAILABLE,
+    AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
     NodeTag        type;
@@ -1013,6 +1019,11 @@ typedef struct PlanState
     bool        outeropsset;
     bool        inneropsset;
     bool        resultopsset;
+
+    /* Async subnode execution stuff */
+    AsyncState    asyncstate;
+
+    int32        padding;            /* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1208,14 +1219,21 @@ struct AppendState
     PlanState    ps;                /* its first field is NodeTag */
     PlanState **appendplans;    /* array of PlanStates for my inputs */
     int            as_nplans;
-    int            as_whichplan;
+    int            as_whichsyncplan; /* which sync plan is being executed  */
     int            as_first_partial_plan;    /* Index of 'appendplans' containing
                                          * the first partial plan */
+    int            as_nasyncplans;    /* # of async-capable children */
     ParallelAppendState *as_pstate; /* parallel coordination info */
     Size        pstate_len;        /* size of parallel coordination info */
     struct PartitionPruneState *as_prune_state;
-    Bitmapset  *as_valid_subplans;
+    Bitmapset  *as_valid_syncsubplans;
     bool        (*choose_next_subplan) (AppendState *);
+    bool        as_syncdone;    /* all synchronous plans done? */
+    Bitmapset  *as_needrequest;    /* async plans needing a new request */
+    Bitmapset  *as_pending_async;    /* pending async plans */
+    TupleTableSlot **as_asyncresult;    /* results of each async plan */
+    int            as_nasyncresult;    /* # of valid entries in as_asyncresult */
+    bool        as_exec_prune;    /* runtime pruning needed for async exec? */
 };
 
 /* ----------------
@@ -1783,6 +1801,7 @@ typedef struct ForeignScanState
     Size        pscan_len;        /* size of parallel coordination information */
     /* use struct pointer to avoid including fdwapi.h here */
     struct FdwRoutine *fdwroutine;
+    bool        fs_async;
     void       *fdw_state;        /* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..abad89b327 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -135,6 +135,11 @@ typedef struct Plan
     bool        parallel_aware; /* engage parallel-aware logic? */
     bool        parallel_safe;    /* OK to use as part of parallel plan? */
 
+    /*
+     * information needed for asynchronous execution
+     */
+    bool        async_capable;  /* engage asynchronous execution logic? */
+
     /*
      * Common structural data for all Plan types.
      */
@@ -262,6 +267,10 @@ typedef struct Append
 
     /* Info for run-time subplan pruning; NULL if we're not doing that */
     struct PartitionPruneInfo *part_prune_info;
+
+    /* Async child node execution stuff */
+    int            nasyncplans;    /* # async subplans, always at start of list */
+    int            referent;         /* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..53876b2d8b 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -241,4 +241,6 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
                                     List *live_childrels);
 
+extern bool is_async_capable_path(Path *path);
+
 #endif                            /* PATHS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..d673f9da6b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,7 +887,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
     WAIT_EVENT_SYNC_REP,
-    WAIT_EVENT_XACT_GROUP_UPDATE
+    WAIT_EVENT_XACT_GROUP_UPDATE,
+    WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.18.4

From 4df7f9b34ad8d9fd9b415459e2673ebe27f72343 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH v7 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c             |  28 +
 .../postgres_fdw/expected/postgres_fdw.out    | 272 ++++----
 contrib/postgres_fdw/postgres_fdw.c           | 601 +++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h           |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql     |  20 +-
 5 files changed, 710 insertions(+), 213 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..be5948f613 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -59,6 +59,7 @@ typedef struct ConnCacheEntry
     bool        invalidated;    /* true if reconnect is pending */
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
+    void        *storage;        /* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -203,6 +204,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
+        entry->storage = NULL;
     }
 
     /*
@@ -216,6 +218,32 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     return entry->conn;
 }
 
+/*
+ * Returns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* Find storage using the same key with GetConnection */
+    key = user->umid;
+    entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+    Assert(found);
+
+    /* Create one if not yet. */
+    if (entry->storage == NULL)
+    {
+        entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+        memset(entry->storage, 0, initsize);
+    }
+
+    return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 10e23d02ed..0634ab9f6a 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6986,7 +6986,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -7014,7 +7014,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7042,7 +7042,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -7070,7 +7070,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -7140,35 +7140,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -7178,35 +7184,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar bar_1
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
+                     Sort Key: bar_1.f1
+                     ->  Seq Scan on public.bar bar_1
+                           Output: bar_1.f1, bar_1.f2, bar_1.ctid, bar_1.*, bar_1.tableoid
                ->  Foreign Scan on public.bar2 bar_2
                      Output: bar_2.f1, bar_2.f2, bar_2.ctid, bar_2.*, bar_2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.f1, foo.*, foo.tableoid
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7236,11 +7248,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
    ->  Hash Join
          Output: bar_1.f1, (bar_1.f2 + 100), bar_1.f3, bar_1.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7254,12 +7267,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.f1, foo.*, foo.tableoid
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo foo_1
-                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
-                           ->  Foreign Scan on public.foo2 foo_2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2 foo_2
                                  Output: foo_2.ctid, foo_2.f1, foo_2.*, foo_2.tableoid
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo foo_1
+                                 Output: foo_1.ctid, foo_1.f1, foo_1.*, foo_1.tableoid
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7289,16 +7303,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2 foo_1
+                     Output: ROW(foo_1.f1), foo_1.f1
+                     Remote SQL: SELECT f1 FROM public.loct1
+               ->  Async Foreign Scan on public.foo2 foo_3
+                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo
                      Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2 foo_1
-                     Output: ROW(foo_1.f1), foo_1.f1
-                     Remote SQL: SELECT f1 FROM public.loct1
                ->  Seq Scan on public.foo foo_2
                      Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo_3
-                     Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                     Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7316,17 +7331,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2 foo_1
+                           Output: ROW(foo_1.f1), foo_1.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Async Foreign Scan on public.foo2 foo_3
+                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2 foo_1
-                           Output: ROW(foo_1.f1), foo_1.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_2
                            Output: ROW((foo_2.f1 + 3)), (foo_2.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo_3
-                           Output: ROW((foo_3.f1 + 3)), (foo_3.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7476,27 +7492,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2 bar_1
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2 bar_1
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2 bar_1
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2 bar_1
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8571,11 +8593,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)) INNER JOIN (ftprt1_p1 t3_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)) INNER JOIN (ftprt1_p2 t3_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE
t1.a% 25 =0 ORDER BY 1,2,3;
 
   a  |  b  |  c   
@@ -8610,20 +8633,22 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
 -- with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
-                       QUERY PLAN                       
---------------------------------------------------------
+                          QUERY PLAN                          
+--------------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Hash Full Join
          Hash Cond: (t1.a = t2.b)
          ->  Append
-               ->  Foreign Scan on ftprt1_p1 t1_1
-               ->  Foreign Scan on ftprt1_p2 t1_2
+               Async subplans: 2 
+               ->  Async Foreign Scan on ftprt1_p1 t1_1
+               ->  Async Foreign Scan on ftprt1_p2 t1_2
          ->  Hash
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
-(11 rows)
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
+(13 rows)
 
 SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1 FULL JOIN (SELECT t2 wr, b FROM fprt2
t2WHERE t2.b % 25 = 0) t2 ON (t1.a = t2.b) ORDER BY 1,2;
 
        wr       |       wr       
@@ -8652,11 +8677,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p2 t2_2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE
t1.a%25= 0 ORDER BY 1,2;
 
   a  |  b  
@@ -8709,21 +8735,23 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 -- test FOR UPDATE; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
-                          QUERY PLAN                          
---------------------------------------------------------------
+                             QUERY PLAN                             
+--------------------------------------------------------------------
  LockRows
    ->  Sort
          Sort Key: t1.a
          ->  Hash Join
                Hash Cond: (t2.b = t1.a)
                ->  Append
-                     ->  Foreign Scan on ftprt2_p1 t2_1
-                     ->  Foreign Scan on ftprt2_p2 t2_2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on ftprt2_p1 t2_1
+                     ->  Async Foreign Scan on ftprt2_p2 t2_2
                ->  Hash
                      ->  Append
-                           ->  Foreign Scan on ftprt1_p1 t1_1
-                           ->  Foreign Scan on ftprt1_p2 t1_2
-(12 rows)
+                           Async subplans: 2 
+                           ->  Async Foreign Scan on ftprt1_p1 t1_1
+                           ->  Async Foreign Scan on ftprt1_p2 t1_2
+(14 rows)
 
 SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF
t1;
   a  |  b  
@@ -8758,18 +8786,19 @@ ANALYZE fpagg_tab_p3;
 SET enable_partitionwise_aggregate TO false;
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
-                        QUERY PLAN                         
------------------------------------------------------------
+                           QUERY PLAN                            
+-----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.a
    ->  HashAggregate
          Group Key: pagg_tab.a
          Filter: (avg(pagg_tab.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1 pagg_tab_1
-               ->  Foreign Scan on fpagg_tab_p2 pagg_tab_2
-               ->  Foreign Scan on fpagg_tab_p3 pagg_tab_3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8780,13 +8809,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: pagg_tab.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p2 pagg_tab_1)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (fpagg_tab_p3 pagg_tab_2)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
@@ -8808,29 +8838,22 @@ SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  Sort
    Output: t1.a, (count(((t1.*)::pagg_tab)))
    Sort Key: t1.a
-   ->  Append
-         ->  HashAggregate
-               Output: t1.a, count(((t1.*)::pagg_tab))
-               Group Key: t1.a
-               Filter: (avg(t1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p1 t1
-                     Output: t1.a, t1.*, t1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
-         ->  HashAggregate
-               Output: t1_1.a, count(((t1_1.*)::pagg_tab))
-               Group Key: t1_1.a
-               Filter: (avg(t1_1.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p2 t1_1
+   ->  HashAggregate
+         Output: t1.a, count(((t1.*)::pagg_tab))
+         Group Key: t1.a
+         Filter: (avg(t1.b) < '22'::numeric)
+         ->  Append
+               Async subplans: 3 
+               ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
                      Output: t1_1.a, t1_1.*, t1_1.b
-                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
-         ->  HashAggregate
-               Output: t1_2.a, count(((t1_2.*)::pagg_tab))
-               Group Key: t1_2.a
-               Filter: (avg(t1_2.b) < '22'::numeric)
-               ->  Foreign Scan on public.fpagg_tab_p3 t1_2
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
+               ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
                      Output: t1_2.a, t1_2.*, t1_2.b
+                     Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
+               ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
+                     Output: t1_3.a, t1_3.*, t1_3.b
                      Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
-(25 rows)
+(18 rows)
 
 SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | count 
@@ -8850,20 +8873,15 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 -----------------------------------------------------------------
  Sort
    Sort Key: pagg_tab.b
-   ->  Finalize HashAggregate
+   ->  HashAggregate
          Group Key: pagg_tab.b
          Filter: (sum(pagg_tab.a) < 700)
          ->  Append
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab.b
-                     ->  Foreign Scan on fpagg_tab_p1 pagg_tab
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_1.b
-                     ->  Foreign Scan on fpagg_tab_p2 pagg_tab_1
-               ->  Partial HashAggregate
-                     Group Key: pagg_tab_2.b
-                     ->  Foreign Scan on fpagg_tab_p3 pagg_tab_2
-(15 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1 pagg_tab_1
+               ->  Async Foreign Scan on fpagg_tab_p2 pagg_tab_2
+               ->  Async Foreign Scan on fpagg_tab_p3 pagg_tab_3
+(10 rows)
 
 -- ===================================================================
 -- access rights and superuser
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a31abce7c9..14824368cc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -21,6 +21,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -35,6 +37,7 @@
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "postgres_fdw.h"
 #include "utils/builtins.h"
 #include "utils/float.h"
@@ -56,6 +59,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrieve PgFdwScanState struct from ForeignScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,11 +128,29 @@ enum FdwDirectModifyPrivateIndex
     FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection common state - shared among all PgFdwState instances using the
+ * same connection.
+ */
+typedef struct PgFdwConnCommonState
+{
+    ForeignScanState   *leader;        /* leader node of this connection */
+    bool                busy;        /* true if this connection is busy */
+} PgFdwConnCommonState;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+    PGconn       *conn;                    /* connection for the scan */
+    PgFdwConnCommonState *commonstate;    /* connection common state */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table. NULL
                                  * for a foreign join scan. */
     TupleDesc    tupdesc;        /* tuple descriptor of scan */
@@ -137,7 +161,6 @@ typedef struct PgFdwScanState
     List       *retrieved_attrs;    /* list of retrieved attribute numbers */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     unsigned int cursor_number; /* quasi-unique ID for my cursor */
     bool        cursor_exists;    /* have we created the cursor? */
     int            numParams;        /* number of parameters passed to query */
@@ -153,6 +176,12 @@ typedef struct PgFdwScanState
     /* batch-level state, for optimizing rewinds and avoiding useless fetch */
     int            fetch_ct_2;        /* Min(# of fetches done, 2) */
     bool        eof_reached;    /* true if last fetch reached EOF */
+    bool        async;            /* true if run asynchronously */
+    bool        queued;            /* true if this node is in waiter queue */
+    ForeignScanState *waiter;    /* Next node to run a query among nodes
+                                 * sharing the same connection */
+    ForeignScanState *last_waiter;    /* last element in waiter queue.
+                                     * valid only on the leader node */
 
     /* working memory contexts */
     MemoryContext batch_cxt;    /* context holding current batch of tuples */
@@ -166,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
     /* for remote query execution */
-    PGconn       *conn;            /* connection for the scan */
     char       *p_name;            /* name of prepared statement, if created */
 
     /* extracted fdw_private data */
@@ -197,6 +226,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+    PgFdwState    s;                /* common structure */
     Relation    rel;            /* relcache entry for the foreign table */
     AttInMetadata *attinmeta;    /* attribute datatype conversion metadata */
 
@@ -326,6 +356,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
                                             RangeTblEntry *target_rte,
                                             Relation target_relation);
@@ -391,6 +422,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
                                          RelOptInfo *input_rel,
                                          RelOptInfo *output_rel,
                                          void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+                                              WaitEventSet *wes,
+                                              void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -419,7 +454,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
                                       EquivalenceClass *ec, EquivalenceMember *em,
                                       void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
                                                RangeTblEntry *rte,
@@ -522,6 +559,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     routine->IterateForeignScan = postgresIterateForeignScan;
     routine->ReScanForeignScan = postgresReScanForeignScan;
     routine->EndForeignScan = postgresEndForeignScan;
+    routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
     /* Functions for updating foreign tables */
     routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -558,6 +596,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for async execution */
+    routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+    routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1433,12 +1475,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->s.conn = GetConnection(user, false);
+    fsstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
+    fsstate->s.commonstate->leader = NULL;
+    fsstate->s.commonstate->busy = false;
+    fsstate->waiter = NULL;
+    fsstate->last_waiter = node;
 
     /* Assign a unique ID for my cursor */
-    fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+    fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
     fsstate->cursor_exists = false;
 
+    /* Initialize async execution status */
+    fsstate->async = false;
+    fsstate->queued = false;
+
     /* Get private info created by planner functions. */
     fsstate->query = strVal(list_nth(fsplan->fdw_private,
                                      FdwScanPrivateSelectSql));
@@ -1486,40 +1538,241 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
                              &fsstate->param_values);
 }
 
+/*
+ * Async queue manipulation functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * Enqueue node if it isn't in the queue. Immediately send request it if the
+ * underlying connection is not busy.
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+    PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+
+    /*
+     * Do nothing if the node is already in the queue or already eof'ed.
+     * Note: leader node is not marked as queued.
+     */
+    if (leader == node || fsstate->queued || fsstate->eof_reached)
+        return;
+
+    if (leader == NULL)
+    {
+        /* no leader means not busy, send request immediately */
+        request_more_data(node);
+    }
+    else
+    {
+        /* the connection is busy, queue the node */
+        PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+        PgFdwScanState   *last_waiter_state
+            = GetPgFdwScanState(leader_state->last_waiter);
+
+        last_waiter_state->waiter = node;
+        leader_state->last_waiter = node;
+        fsstate->queued = true;
+    }
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Make the first waiter be the next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+    PgFdwScanState *leader_state = GetPgFdwScanState(node);
+    ForeignScanState *next_leader = leader_state->waiter;
+
+    Assert(leader_state->s.commonstate->leader = node);
+
+    if (next_leader)
+    {
+        /* the first waiter becomes the next leader */
+        PgFdwScanState *next_leader_state = GetPgFdwScanState(next_leader);
+        next_leader_state->last_waiter = leader_state->last_waiter;
+        next_leader_state->queued = false;
+    }
+
+    leader_state->waiter = NULL;
+    leader_state->s.commonstate->leader = next_leader;
+
+    return next_leader;
+}
+
+/*
+ * Remove the node from waiter queue.
+ *
+ * Remaining results are cleared if the node is a busy leader.
+ * This intended to be used during node shutdown.
+ */
+static inline void
+remove_async_node(ForeignScanState *node)
+{
+    PgFdwScanState        *fsstate = GetPgFdwScanState(node);
+    ForeignScanState    *leader = fsstate->s.commonstate->leader;
+    PgFdwScanState        *leader_state;
+    ForeignScanState    *prev;
+    PgFdwScanState        *prev_state;
+    ForeignScanState    *cur;
+
+    /* no need to remove me */
+    if (!leader || !fsstate->queued)
+        return;
+
+    leader_state = GetPgFdwScanState(leader);
+
+    if (leader == node)
+    {
+        if (leader_state->s.commonstate->busy)
+        {
+            /*
+             * this node is waiting for result, absorb the result first so
+             * that the following commands can be sent on the connection.
+             */
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+            PGconn *conn = leader_state->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
+            leader_state->s.commonstate->busy = false;
+        }
+
+        move_to_next_waiter(node);
+
+        return;
+    }
+
+    /*
+     * Just remove the node from the queue
+     *
+     * Nodes don't have a link to the previous node but anyway this function is
+     * called on the shutdown path, so we don't bother seeking for faster way
+     * to do this.
+     */
+    prev = leader;
+    prev_state = leader_state;
+    cur =  GetPgFdwScanState(prev)->waiter;
+    while (cur)
+    {
+        PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+        if (cur == node)
+        {
+            prev_state->waiter = curstate->waiter;
+
+            /* relink to the previous node if the last node was removed */
+            if (leader_state->last_waiter == cur)
+                leader_state->last_waiter = prev;
+
+            fsstate->queued = false;
+
+            return;
+        }
+        prev = cur;
+        prev_state = curstate;
+        cur = curstate->waiter;
+    }
+}
+
 /*
  * postgresIterateForeignScan
- *        Retrieve next row from the result set, or clear tuple slot to indicate
- *        EOF.
+ *        Retrieve next row from the result set.
+ *
+ *        For synchronous nodes, returns clear tuple slot means EOF.
+ *
+ *        For asynchronous nodes, if clear tuple slot is returned, the caller
+ *        needs to check async state to tell if all tuples received
+ *        (AS_AVAILABLE) or waiting for the next data to come (AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-    /*
-     * If this is the first call after Begin or ReScan, we need to create the
-     * cursor on the remote side.
-     */
-    if (!fsstate->cursor_exists)
-        create_cursor(node);
-
-    /*
-     * Get some more tuples, if we've run out.
-     */
+    if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+    {
+        /* we've run out, get some more tuples */
+        if (!node->fs_async)
+        {
+            /*
+             * finish the running query before sending the next command for
+             * this node
+             */
+            if (!fsstate->s.commonstate->busy)
+                vacate_connection((PgFdwState *)fsstate, false);
+
+            request_more_data(node);
+
+            /* Fetch the result immediately. */
+            fetch_received_data(node);
+        }
+        else if (!fsstate->s.commonstate->busy)
+        {
+            /* If the connection is not busy, just send the request. */
+            request_more_data(node);
+        }
+        else
+        {
+            /* The connection is busy, queue the request */
+            bool available = true;
+            ForeignScanState *leader = fsstate->s.commonstate->leader;
+            PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+            /* queue the requested node */
+            add_async_waiter(node);
+
+            /*
+             * The request for the next node cannot be sent before the leader
+             * responds. Finish the current leader if possible.
+             */
+            if (PQisBusy(leader_state->s.conn))
+            {
+                int rc = WaitLatchOrSocket(NULL,
+                                           WL_SOCKET_READABLE | WL_TIMEOUT |
+                                           WL_EXIT_ON_PM_DEATH,
+                                           PQsocket(leader_state->s.conn), 0,
+                                           WAIT_EVENT_ASYNC_WAIT);
+                if (!(rc & WL_SOCKET_READABLE))
+                    available = false;
+            }
+
+            /* fetch the leader's data and enqueue it for the next request */
+            if (available)
+            {
+                fetch_received_data(leader);
+                add_async_waiter(leader);
+            }
+        }
+    }
+
     if (fsstate->next_tuple >= fsstate->num_tuples)
     {
-        /* No point in another fetch if we already detected EOF, though. */
-        if (!fsstate->eof_reached)
-            fetch_more_data(node);
-        /* If we didn't get any tuples, must be end of data. */
-        if (fsstate->next_tuple >= fsstate->num_tuples)
-            return ExecClearTuple(slot);
+        /*
+         * We haven't received a result for the given node this time, return
+         * with no tuple to give way to another node.
+         */
+        if (fsstate->eof_reached)
+            node->ss.ps.asyncstate = AS_AVAILABLE;
+        else
+            node->ss.ps.asyncstate = AS_WAITING;
+
+        return ExecClearTuple(slot);
     }
 
     /*
      * Return the next tuple.
      */
+    node->ss.ps.asyncstate = AS_AVAILABLE;
     ExecStoreHeapTuple(fsstate->tuples[fsstate->next_tuple++],
                        slot,
                        false);
@@ -1534,7 +1787,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     char        sql[64];
     PGresult   *res;
 
@@ -1542,6 +1795,8 @@ postgresReScanForeignScan(ForeignScanState *node)
     if (!fsstate->cursor_exists)
         return;
 
+    vacate_connection((PgFdwState *)fsstate, true);
+
     /*
      * If any internal parameters affecting this node have changed, we'd
      * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1570,9 +1825,9 @@ postgresReScanForeignScan(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_exec_query(fsstate->conn, sql);
+    res = pgfdw_exec_query(fsstate->s.conn, sql);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+        pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
     PQclear(res);
 
     /* Now force a fresh FETCH. */
@@ -1590,7 +1845,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
     /* if fsstate is NULL, we are in EXPLAIN; nothing to do */
     if (fsstate == NULL)
@@ -1598,15 +1853,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
     /* Close the cursor if open, to prevent accumulation of cursors */
     if (fsstate->cursor_exists)
-        close_cursor(fsstate->conn, fsstate->cursor_number);
+        close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
     /* Release remote connection */
-    ReleaseConnection(fsstate->conn);
-    fsstate->conn = NULL;
+    ReleaseConnection(fsstate->s.conn);
+    fsstate->s.conn = NULL;
 
     /* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *        Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+    ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+    if (plan->operation != CMD_SELECT)
+        return;
+
+    /* remove the node from waiting queue */
+    remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *        Add resjunk column(s) needed for update/delete on a foreign table
@@ -2371,7 +2642,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->s.conn = GetConnection(user, false);
+    dmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2456,7 +2729,11 @@ postgresIterateDirectModify(ForeignScanState *node)
      * If this is the first call after Begin, execute the statement.
      */
     if (dmstate->num_tuples == -1)
+    {
+        /* finish running query to send my command */
+        vacate_connection((PgFdwState *)dmstate, true);
         execute_dml_stmt(node);
+    }
 
     /*
      * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2503,8 +2780,8 @@ postgresEndDirectModify(ForeignScanState *node)
         PQclear(dmstate->result);
 
     /* Release remote connection */
-    ReleaseConnection(dmstate->conn);
-    dmstate->conn = NULL;
+    ReleaseConnection(dmstate->s.conn);
+    dmstate->s.conn = NULL;
 
     /* MemoryContext will be deleted automatically. */
 }
@@ -2702,6 +2979,7 @@ estimate_path_cost_size(PlannerInfo *root,
         List       *local_param_join_conds;
         StringInfoData sql;
         PGconn       *conn;
+        PgFdwConnCommonState *commonstate;
         Selectivity local_sel;
         QualCost    local_cost;
         List       *fdw_scan_tlist = NIL;
@@ -2746,6 +3024,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
         /* Get the remote estimate */
         conn = GetConnection(fpinfo->user, false);
+        commonstate = GetConnectionSpecificStorage(fpinfo->user,
+                                                sizeof(PgFdwConnCommonState));
+        if (commonstate)
+        {
+            PgFdwState tmpstate;
+            tmpstate.conn = conn;
+            tmpstate.commonstate = commonstate;
+
+            /* finish running query to send my command */
+            vacate_connection(&tmpstate, true);
+        }
+
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3316,11 +3606,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     ExprContext *econtext = node->ss.ps.ps_ExprContext;
     int            numParams = fsstate->numParams;
     const char **values = fsstate->param_values;
-    PGconn       *conn = fsstate->conn;
+    PGconn       *conn = fsstate->s.conn;
     StringInfoData buf;
     PGresult   *res;
 
@@ -3383,50 +3673,119 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-    PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+    ForeignScanState *leader = fsstate->s.commonstate->leader;
+    PGconn       *conn = fsstate->s.conn;
+    char        sql[64];
+
+    /* must be non-busy */
+    Assert(!fsstate->s.commonstate->busy);
+    /* must be not-eof'ed */
+    Assert(!fsstate->eof_reached);
+
+    /*
+     * If this is the first call after Begin or ReScan, we need to create the
+     * cursor on the remote side.
+     */
+    if (!fsstate->cursor_exists)
+        create_cursor(node);
+
+    snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+             fsstate->fetch_size, fsstate->cursor_number);
+
+    if (!PQsendQuery(conn, sql))
+        pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+    fsstate->s.commonstate->busy = true;
+
+    /* The node is the current leader, just return. */
+    if (leader == node)
+        return;
+
+    /* Let the node be the leader */
+    if (leader != NULL)
+    {
+        remove_async_node(node);
+        fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+        fsstate->waiter = leader;
+    }
+    else
+    {
+        fsstate->last_waiter = node;
+        fsstate->waiter = NULL;
+    }
+
+    fsstate->s.commonstate->leader = node;
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
     PGresult   *volatile res = NULL;
     MemoryContext oldcontext;
+    ForeignScanState *waiter;
+
+    /* I should be the current connection leader */
+    Assert(fsstate->s.commonstate->leader == node);
 
     /*
      * We'll store the tuples in the batch_cxt.  First, flush the previous
-     * batch.
+     * batch if no tuple is remaining
      */
-    fsstate->tuples = NULL;
-    MemoryContextReset(fsstate->batch_cxt);
+    if (fsstate->next_tuple >= fsstate->num_tuples)
+    {
+        fsstate->tuples = NULL;
+        fsstate->num_tuples = 0;
+        MemoryContextReset(fsstate->batch_cxt);
+    }
+    else if (fsstate->next_tuple > 0)
+    {
+        /* There's some remains. Move them to the beginning of the store */
+        int n = 0;
+
+        while(fsstate->next_tuple < fsstate->num_tuples)
+            fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+        fsstate->num_tuples = n;
+    }
+
     oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
     /* PGresult must be released before leaving this function. */
     PG_TRY();
     {
-        PGconn       *conn = fsstate->conn;
-        char        sql[64];
-        int            numrows;
+        PGconn       *conn = fsstate->s.conn;
+        int            addrows;
+        size_t        newsize;
         int            i;
 
-        snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-                 fsstate->fetch_size, fsstate->cursor_number);
-
-        res = pgfdw_exec_query(conn, sql);
-        /* On error, report the original query, not the FETCH. */
+        res = pgfdw_get_result(conn, fsstate->query);
         if (PQresultStatus(res) != PGRES_TUPLES_OK)
             pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
         /* Convert the data into HeapTuples */
-        numrows = PQntuples(res);
-        fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-        fsstate->num_tuples = numrows;
-        fsstate->next_tuple = 0;
+        addrows = PQntuples(res);
+        newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+        if (fsstate->tuples)
+            fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+        else
+            fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-        for (i = 0; i < numrows; i++)
+        for (i = 0; i < addrows; i++)
         {
             Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-            fsstate->tuples[i] =
+            fsstate->tuples[fsstate->num_tuples + i] =
                 make_tuple_from_result_row(res, i,
                                            fsstate->rel,
                                            fsstate->attinmeta,
@@ -3436,22 +3795,73 @@ fetch_more_data(ForeignScanState *node)
         }
 
         /* Update fetch_ct_2 */
-        if (fsstate->fetch_ct_2 < 2)
+        if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
             fsstate->fetch_ct_2++;
 
+        fsstate->next_tuple = 0;
+        fsstate->num_tuples += addrows;
+
         /* Must be EOF if we didn't get as many tuples as we asked for. */
-        fsstate->eof_reached = (numrows < fsstate->fetch_size);
+        fsstate->eof_reached = (addrows < fsstate->fetch_size);
     }
     PG_FINALLY();
     {
+        fsstate->s.commonstate->busy = false;
+
         if (res)
             PQclear(res);
     }
     PG_END_TRY();
 
+    /* let the first waiter be the next leader of this connection */
+    waiter = move_to_next_waiter(node);
+
+    /* send the next request if any */
+    if (waiter)
+        request_more_data(waiter);
+
     MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate the underlying connection so that this node can send the next query.
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+    PgFdwConnCommonState *commonstate = fdwstate->commonstate;
+    ForeignScanState *leader;
+
+    Assert(commonstate != NULL);
+
+    /* just return if the connection is already available */
+    if (commonstate->leader == NULL || !commonstate->busy)
+        return;
+
+    /*
+     * let the current connection leader read all of the result for the running
+     * query
+     */
+    leader = commonstate->leader;
+    fetch_received_data(leader);
+
+    /* let the first waiter be the next leader of this connection */
+    move_to_next_waiter(leader);
+
+    if (!clear_queue)
+        return;
+
+    /* Clear the waiting list */
+    while (leader)
+    {
+        PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+        fsstate->last_waiter = NULL;
+        leader = fsstate->waiter;
+        fsstate->waiter = NULL;
+    }
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3565,7 +3975,9 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->s.conn = GetConnection(user, true);
+    fmstate->s.commonstate = (PgFdwConnCommonState *)
+        GetConnectionSpecificStorage(user, sizeof(PgFdwConnCommonState));
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -3652,6 +4064,9 @@ execute_foreign_modify(EState *estate,
            operation == CMD_UPDATE ||
            operation == CMD_DELETE);
 
+    /* finish running query to send my command */
+    vacate_connection((PgFdwState *)fmstate, true);
+
     /* Set up the prepared statement on the remote server, if we didn't yet */
     if (!fmstate->p_name)
         prepare_foreign_modify(fmstate);
@@ -3679,14 +4094,14 @@ execute_foreign_modify(EState *estate,
     /*
      * Execute the prepared statement.
      */
-    if (!PQsendQueryPrepared(fmstate->conn,
+    if (!PQsendQueryPrepared(fmstate->s.conn,
                              fmstate->p_name,
                              fmstate->p_nums,
                              p_values,
                              NULL,
                              NULL,
                              0))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3694,10 +4109,10 @@ execute_foreign_modify(EState *estate,
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) !=
         (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
     /* Check number of rows affected, and fetch RETURNING tuple if any */
     if (fmstate->has_returning)
@@ -3733,7 +4148,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
     /* Construct name we'll use for the prepared statement. */
     snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-             GetPrepStmtNumber(fmstate->conn));
+             GetPrepStmtNumber(fmstate->s.conn));
     p_name = pstrdup(prep_name);
 
     /*
@@ -3743,12 +4158,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * the prepared statements we use in this module are simple enough that
      * the remote server will make the right choices.
      */
-    if (!PQsendPrepare(fmstate->conn,
+    if (!PQsendPrepare(fmstate->s.conn,
                        p_name,
                        fmstate->query,
                        0,
                        NULL))
-        pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+        pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -3756,9 +4171,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    res = pgfdw_get_result(fmstate->conn, fmstate->query);
+    res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
     if (PQresultStatus(res) != PGRES_COMMAND_OK)
-        pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+        pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
     PQclear(res);
 
     /* This action shows that the prepare has been done. */
@@ -3887,16 +4302,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
          * We don't use a PG_TRY block here, so be careful not to throw error
          * without releasing the PGresult.
          */
-        res = pgfdw_exec_query(fmstate->conn, sql);
+        res = pgfdw_exec_query(fmstate->s.conn, sql);
         if (PQresultStatus(res) != PGRES_COMMAND_OK)
-            pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+            pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
         PQclear(res);
         fmstate->p_name = NULL;
     }
 
     /* Release remote connection */
-    ReleaseConnection(fmstate->conn);
-    fmstate->conn = NULL;
+    ReleaseConnection(fmstate->s.conn);
+    fmstate->s.conn = NULL;
 }
 
 /*
@@ -4055,9 +4470,9 @@ execute_dml_stmt(ForeignScanState *node)
      * the desired result.  This allows us to avoid assuming that the remote
      * server has the same OIDs we do for the parameters' types.
      */
-    if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+    if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
                            NULL, values, NULL, NULL, 0))
-        pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+        pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
     /*
      * Get the result, and check for success.
@@ -4065,10 +4480,10 @@ execute_dml_stmt(ForeignScanState *node)
      * We don't use a PG_TRY block here, so be careful not to throw error
      * without releasing the PGresult.
      */
-    dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+    dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
     if (PQresultStatus(dmstate->result) !=
         (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-        pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+        pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
                            dmstate->query);
 
     /* Get the number of rows affected. */
@@ -5559,6 +5974,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
     /* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+    return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add wait event so that the ForeignScan node is going to wait for.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+                                  void *caller_data, bool reinit)
+{
+    PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+    /* Reinit is not supported for now. */
+    Assert(reinit);
+
+    if (fsstate->s.commonstate->leader == node)
+    {
+        AddWaitEventToSet(wes,
+                          WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+                          NULL, caller_data);
+        return true;
+    }
+
+    return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..96af75a33e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -85,6 +85,7 @@ typedef struct PgFdwRelationInfo
     UserMapping *user;            /* only set in use_remote_estimate mode */
 
     int            fetch_size;        /* fetch size for this remote table */
+    bool        allow_prefetch;    /* true to allow overlapped fetching  */
 
     /*
      * Name of the relation, for use while EXPLAINing ForeignScan.  It is used
@@ -130,6 +131,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 78156d10b4..17d461b1a4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1799,25 +1799,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1859,12 +1859,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1923,8 +1923,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.18.4


Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> BTW: I noticed that you changed the ExecProcNode() API so that an
> Append calling FDWs can know wether they return tuples immediately or
> not:

> That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
> PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
> whether it returns a tuple immediately or not, and then 2) the Append
> knows that from the new flag when the callback routine returns.  I’m
> not sure this is a good idea, because it seems likely that the
> ExecProcNode() change would affect many other places in the executor,
> making maintenance and/or future development difficult.  I think the
> FDW callback routines proposed in the original patch by Robert would
> provide a cleaner way to do asynchronous execution of FDWs without
> changing the ExecProcNode() API, IIUC:
>
> +On the other hand, nodes that wish to produce tuples asynchronously
> +generally need to implement three methods:
> +
> +1. When an asynchronous request is made, the node's ExecAsyncRequest callback
> +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
> +number of file descriptor events for which it wishes to wait and whether it
> +wishes to receive a callback when the process latch is set. Alternatively,
> +it can instead use ExecAsyncRequestDone if a result is available immediately.
> +
> +2. When the event loop wishes to wait or poll for file descriptor events and
> +the process latch, the ExecAsyncConfigureWait callback is invoked to configure
> +the file descriptor wait events for which the node wishes to wait.  This
> +callback isn't needed if the node only cares about the process latch.
> +
> +3. When file descriptors or the process latch become ready, the node's
> +ExecAsyncNotify callback is invoked.
>
> What is the reason for not doing like this in your patch?

I think we should avoid changing the ExecProcNode() API.

Thomas’ patch also provides a clean FDW API that doesn’t change the
ExecProcNode() API, but I think the FDW API provided in Robert’ patch
would be better designed, because I think it would support more
different types of asynchronous interaction between the core and FDWs.
Consider this bit from Thomas’ patch, which produces a tuple when a
file descriptor becomes ready:

+       if (event.events & WL_SOCKET_READABLE)
+       {
+           /* Linear search for the node that told us to wait for this fd. */
+           for (i = 0; i < node->nasyncplans; ++i)
+           {
+               if (event.fd == node->asyncfds[i])
+               {
+                   TupleTableSlot *result;
+
+                   /*
+ -->               * We assume that because the fd is ready, it can produce
+ -->               * a tuple now, which is not perfect.  An improvement
+ -->               * would be if it could say 'not yet, I'm still not
+ -->               * ready', so eg postgres_fdw could PQconsumeInput and
+ -->               * then say 'I need more input'.
+                    */
+                   result = ExecProcNode(node->asyncplans[i]);
+                   if (!TupIsNull(result))
+                   {
+                       /*
+                        * Remember this plan so that append_next_async will
+                        * keep trying this subplan first until it stops
+                        * feeding us buffered tuples.
+                        */
+                       node->lastreadyplan = i;
+                       /* We can stop waiting for this fd. */
+                       node->asyncfds[i] = 0;
+                       return result;
+                   }
+                   else
+                   {
+                       /*
+                        * This subplan has reached EOF.  We'll go back and
+                        * wait for another one.
+                        */
+                       forget_async_subplan(node, i);
+                       break;
+                   }
+               }
+           }
+       }

As commented above, his patch doesn’t allow an FDW to do another data
fetch from the remote side before returning a tuple when the file
descriptor becomes available, but Robert’s patch would, using his FDW
API ForeignAsyncNotify(), which is called when the file descriptor
becomes available, IIUC.

I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > BTW: I noticed that you changed the ExecProcNode() API so that an
> > Append calling FDWs can know wether they return tuples immediately or
> > not:
> 
> > That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
> > PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
> > whether it returns a tuple immediately or not, and then 2) the Append
> > knows that from the new flag when the callback routine returns.  I’m
> > not sure this is a good idea, because it seems likely that the
> > ExecProcNode() change would affect many other places in the executor,
> > making maintenance and/or future development difficult.  I think the
> > FDW callback routines proposed in the original patch by Robert would
> > provide a cleaner way to do asynchronous execution of FDWs without
> > changing the ExecProcNode() API, IIUC:
> >
> > +On the other hand, nodes that wish to produce tuples asynchronously
> > +generally need to implement three methods:
> > +
> > +1. When an asynchronous request is made, the node's ExecAsyncRequest callback
> > +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
> > +number of file descriptor events for which it wishes to wait and whether it
> > +wishes to receive a callback when the process latch is set. Alternatively,
> > +it can instead use ExecAsyncRequestDone if a result is available immediately.
> > +
> > +2. When the event loop wishes to wait or poll for file descriptor events and
> > +the process latch, the ExecAsyncConfigureWait callback is invoked to configure
> > +the file descriptor wait events for which the node wishes to wait.  This
> > +callback isn't needed if the node only cares about the process latch.
> > +
> > +3. When file descriptors or the process latch become ready, the node's
> > +ExecAsyncNotify callback is invoked.
> >
> > What is the reason for not doing like this in your patch?
> 
> I think we should avoid changing the ExecProcNode() API.
> Thomas’ patch also provides a clean FDW API that doesn’t change the
> ExecProcNode() API, but I think the FDW API provided in Robert’ patch

Could you explain about what the "change" you are mentioning is?

I have made many changes to reduce performance inpact on existing
paths (before the current PlanState.ExecProcNode was introduced.) So
large part of my changes could be actually reverted.

> would be better designed, because I think it would support more
> different types of asynchronous interaction between the core and FDWs.
> Consider this bit from Thomas’ patch, which produces a tuple when a
> file descriptor becomes ready:
> 
> +       if (event.events & WL_SOCKET_READABLE)
> +       {
> +           /* Linear search for the node that told us to wait for this fd. */
> +           for (i = 0; i < node->nasyncplans; ++i)
> +           {
> +               if (event.fd == node->asyncfds[i])
> +               {
> +                   TupleTableSlot *result;
> +
> +                   /*
> + -->               * We assume that because the fd is ready, it can produce
> + -->               * a tuple now, which is not perfect.  An improvement
> + -->               * would be if it could say 'not yet, I'm still not
> + -->               * ready', so eg postgres_fdw could PQconsumeInput and
> + -->               * then say 'I need more input'.
> +                    */
> +                   result = ExecProcNode(node->asyncplans[i]);
..
> As commented above, his patch doesn’t allow an FDW to do another data
> fetch from the remote side before returning a tuple when the file
> descriptor becomes available, but Robert’s patch would, using his FDW
> API ForeignAsyncNotify(), which is called when the file descriptor
> becomes available, IIUC.
> 
> I might be missing something, but I feel inclined to vote for Robert’s
> patch (more precisely, Robert’s patch as a base patch with (1) some
> planner/executor changes from Horiguchi-san’s patch and (2)
> postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
> API).

I'm not sure what you have in mind from the description above.  Could
you please ellaborate?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Oct 2, 2020 at 3:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > I think we should avoid changing the ExecProcNode() API.

> Could you explain about what the "change" you are mentioning is?

It’s the contract of the ExecProcNode() API: if the result is NULL or
an empty slot, there is nothing more to do.  You changed it to
something like this: “even if the result is NULL or an empty slot,
there might be something more to do if AS_WAITING, so please wait in
that case”.  That seems pretty invasive to me.

> > I might be missing something, but I feel inclined to vote for Robert’s
> > patch (more precisely, Robert’s patch as a base patch with (1) some
> > planner/executor changes from Horiguchi-san’s patch and (2)
> > postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
> > API).
>
> I'm not sure what you have in mind from the description above.  Could
> you please ellaborate?

Sorry, my explanation was not enough.

You made lots of changes to the original patch by Robert, but I don’t
think those changes are all good; 1) as for the core part, you changed
his patch so that FDWs can interact with the core at execution time,
only through the ForeignAsyncConfigureWait() API, but that resulted in
an invasive change to the ExecProcNode() API as mentioned above, and
2) as for the postgres_fdw part, you changed it so that postgres_fdw
can handle concurrent data fetches from multiple foreign scan nodes
using the same connection, but that would cause a performance issue
that I mentioned in [1].

So I think it would be better to use his patch rather as proposed
except for the postgres_fdw part and Thomas’ patch as a base patch for
that part.  As for your patch, I think we could use some part of it as
improvements.  One thing is the planner/executor changes that lead to
the improved efficiency discussed in [2][3].  Another would be to have
a separate ExecAppend() function for this feature like your patch to
avoid a performance penalty in the case of a plain old Append that
involves no FDWs with asynchronism optimization, if necessary.  I also
think we could probably use the  WaitEventSet-related changes in your
patch (i.e., the 0001 patch).

Does that answer your question?

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAPmGK16%2By8mEX9AT1LXVLksbTyDnYWZXm0uDxZ8bza153Wey9A%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAPmGK14AjvCd9QuoRQ-ATyExA_SiVmGFGstuqAKSzZ7JDJTBVg%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Sun, 4 Oct 2020 18:36:05 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Fri, Oct 2, 2020 at 3:39 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Fri, 2 Oct 2020 09:00:53 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > > I think we should avoid changing the ExecProcNode() API.
> 
> > Could you explain about what the "change" you are mentioning is?

Thank you for the explanation.

> It’s the contract of the ExecProcNode() API: if the result is NULL or
> an empty slot, there is nothing more to do.  You changed it to
> something like this: “even if the result is NULL or an empty slot,
> there might be something more to do if AS_WAITING, so please wait in
> that case”.  That seems pretty invasive to me.

Yeah, it's "invasive' as I intended. I thought that the async-aware
and async-capable nodes should interact using a channel defined as a
part of ExecProcNode API.  It was aiming an increased affinity to
push-up executor framework.

Since the current direction is committing this feature as a
intermediate or tentative implement, it sounds reasonable to avoid
such a change.

> > > I might be missing something, but I feel inclined to vote for Robert’s
> > > patch (more precisely, Robert’s patch as a base patch with (1) some
> > > planner/executor changes from Horiguchi-san’s patch and (2)
> > > postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
> > > API).
> >
> > I'm not sure what you have in mind from the description above.  Could
> > you please ellaborate?
> 
> Sorry, my explanation was not enough.
> 
> You made lots of changes to the original patch by Robert, but I don’t
> think those changes are all good; 1) as for the core part, you changed
> his patch so that FDWs can interact with the core at execution time,
> only through the ForeignAsyncConfigureWait() API, but that resulted in
> an invasive change to the ExecProcNode() API as mentioned above, and
> 2) as for the postgres_fdw part, you changed it so that postgres_fdw
> can handle concurrent data fetches from multiple foreign scan nodes
> using the same connection, but that would cause a performance issue
> that I mentioned in [1].

(Putting aside the bug itself..)

Yeah, I noticed such a possibility of fetch cascading, however, I
think that that situation that the feature is intended for is more
common than the problem case.

Being said, I agree that it is a candidate to rip out when we are
thinking to reduce the footprint of this patch.

> So I think it would be better to use his patch rather as proposed
> except for the postgres_fdw part and Thomas’ patch as a base patch for
> that part.  As for your patch, I think we could use some part of it as
> improvements.  One thing is the planner/executor changes that lead to
> the improved efficiency discussed in [2][3].  Another would be to have
> a separate ExecAppend() function for this feature like your patch to
> avoid a performance penalty in the case of a plain old Append that
> involves no FDWs with asynchronism optimization, if necessary.  I also
> think we could probably use the  WaitEventSet-related changes in your
> patch (i.e., the 0001 patch).
> 
> Does that answer your question?

Yes, thanks.  Comments about the direction from me is as above. Are
you going to continue working on this patch?


> [1] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAPmGK16%2By8mEX9AT1LXVLksbTyDnYWZXm0uDxZ8bza153Wey9A%40mail.gmail.com
> [3] https://www.postgresql.org/message-id/CAPmGK14AjvCd9QuoRQ-ATyExA_SiVmGFGstuqAKSzZ7JDJTBVg%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Oct 5, 2020 at 1:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sun, 4 Oct 2020 18:36:05 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > It’s the contract of the ExecProcNode() API: if the result is NULL or
> > an empty slot, there is nothing more to do.  You changed it to
> > something like this: “even if the result is NULL or an empty slot,
> > there might be something more to do if AS_WAITING, so please wait in
> > that case”.  That seems pretty invasive to me.
>
> Yeah, it's "invasive' as I intended. I thought that the async-aware
> and async-capable nodes should interact using a channel defined as a
> part of ExecProcNode API.  It was aiming an increased affinity to
> push-up executor framework.
>
> Since the current direction is committing this feature as a
> intermediate or tentative implement, it sounds reasonable to avoid
> such a change.

OK  (Actually, I'm wondering if we could probably extend this to the
case where an Append is indirectly on top of a foreign scan node
without changing the ExecProcNode() API.)

> > You made lots of changes to the original patch by Robert, but I don’t
> > think those changes are all good; 1) as for the core part, you changed
> > his patch so that FDWs can interact with the core at execution time,
> > only through the ForeignAsyncConfigureWait() API, but that resulted in
> > an invasive change to the ExecProcNode() API as mentioned above, and
> > 2) as for the postgres_fdw part, you changed it so that postgres_fdw
> > can handle concurrent data fetches from multiple foreign scan nodes
> > using the same connection, but that would cause a performance issue
> > that I mentioned in [1].

> Yeah, I noticed such a possibility of fetch cascading, however, I
> think that that situation that the feature is intended for is more
> common than the problem case.

I think a cleaner solution to that would be to support multiple
connections to the remote server...

> > So I think it would be better to use his patch rather as proposed
> > except for the postgres_fdw part and Thomas’ patch as a base patch for
> > that part.  As for your patch, I think we could use some part of it as
> > improvements.  One thing is the planner/executor changes that lead to
> > the improved efficiency discussed in [2][3].  Another would be to have
> > a separate ExecAppend() function for this feature like your patch to
> > avoid a performance penalty in the case of a plain old Append that
> > involves no FDWs with asynchronism optimization, if necessary.  I also
> > think we could probably use the  WaitEventSet-related changes in your
> > patch (i.e., the 0001 patch).
> >
> > Does that answer your question?
>
> Yes, thanks.  Comments about the direction from me is as above. Are
> you going to continue working on this patch?

Yes, if there are no objections from you or Thomas or Robert or anyone
else, I'll update Robert's patch as such.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 10/5/20 11:35 AM, Etsuro Fujita wrote:
Hi,
I found a small problem. If we have a mix of async and sync subplans 
when we catch an assertion on a busy connection. Just for example:

PLAN
====
Nested Loop  (cost=100.00..174316.95 rows=975 width=8) (actual 
time=5.191..9.262 rows=9 loops=1)
    Join Filter: (frgn.a = l.a)
    Rows Removed by Join Filter: 8991
    ->  Append  (cost=0.00..257.20 rows=11890 width=4) (actual 
time=0.419..2.773 rows=1000 loops=1)
          Async subplans: 4
          ->  Async Foreign Scan on f_1 l_2  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
          ->  Async Foreign Scan on f_2 l_3  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
          ->  Async Foreign Scan on f_3 l_4  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
          ->  Async Foreign Scan on f_4 l_5  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
          ->  Seq Scan on l_0 l_1  (cost=0.00..2.90 rows=190 width=4) 
(actual time=0.017..0.057 rows=190 loops=1)
    ->  Materialize  (cost=100.00..170.94 rows=975 width=4) (actual 
time=0.001..0.002 rows=9 loops=1000)
          ->  Foreign Scan on frgn  (cost=100.00..166.06 rows=975 
width=4) (actual time=0.766..0.768 rows=9 loops=1)

Reproduction script 'test1.sql' see in attachment. Here I force the 
problem reproduction with setting enable_hashjoin and enable_mergejoin 
to off.

'asyncmix.patch' contains my solution to this problem.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
Hi,
I want to suggest one more improvement. Currently the
is_async_capable_path() routine allow only ForeignPath nodes as async 
capable path. But in some cases we can allow SubqueryScanPath as async 
capable too.

For example:
SELECT * FROM ((SELECT * FROM foreign_1)
UNION ALL
(SELECT a FROM foreign_2)) AS b;

is async capable, but:

SELECT * FROM ((SELECT * FROM foreign_1 LIMIT 10)
UNION ALL
(SELECT a FROM foreign_2 LIMIT 10)) AS b;

doesn't async capable.

The patch in attachment tries to improve this situation.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Oct 8, 2020 at 6:39 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I found a small problem. If we have a mix of async and sync subplans
> when we catch an assertion on a busy connection. Just for example:
>
> PLAN
> ====
> Nested Loop  (cost=100.00..174316.95 rows=975 width=8) (actual
> time=5.191..9.262 rows=9 loops=1)
>     Join Filter: (frgn.a = l.a)
>     Rows Removed by Join Filter: 8991
>     ->  Append  (cost=0.00..257.20 rows=11890 width=4) (actual
> time=0.419..2.773 rows=1000 loops=1)
>           Async subplans: 4
>           ->  Async Foreign Scan on f_1 l_2  (cost=100.00..197.75
> rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
>           ->  Async Foreign Scan on f_2 l_3  (cost=100.00..197.75
> rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
>           ->  Async Foreign Scan on f_3 l_4  (cost=100.00..197.75
> rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
>           ->  Async Foreign Scan on f_4 l_5  (cost=100.00..197.75
> rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
>           ->  Seq Scan on l_0 l_1  (cost=0.00..2.90 rows=190 width=4)
> (actual time=0.017..0.057 rows=190 loops=1)
>     ->  Materialize  (cost=100.00..170.94 rows=975 width=4) (actual
> time=0.001..0.002 rows=9 loops=1000)
>           ->  Foreign Scan on frgn  (cost=100.00..166.06 rows=975
> width=4) (actual time=0.766..0.768 rows=9 loops=1)

Actually I also found a similar issue before [1].  But in the first
place I'm not sure the way of handling concurrent data fetches by
multiple ForeignScan nodes using the same connection in postgres_fdw
implemented in Horiguchi-san's patch would be really acceptable,
because that would impact performance *negatively* in some cases as
mentioned in [1].  So I feel inclined to just disable this feature in
problematic cases including the above one in the first cut.  Even with
such a limitation, I think it would be useful, because it would cover
typical use cases such as partitionwise joins and partitionwise
aggregates.

Thanks for the report!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Oct 8, 2020 at 8:40 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I want to suggest one more improvement. Currently the
> is_async_capable_path() routine allow only ForeignPath nodes as async
> capable path. But in some cases we can allow SubqueryScanPath as async
> capable too.

> The patch in attachment tries to improve this situation.

Seems like a good idea.  Will look at the patch in detail.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Yes, if there are no objections from you or Thomas or Robert or anyone
> else, I'll update Robert's patch as such.

Here is a new version of the patch (as promised in the developer
unconference in PostgresConf.CN & PGConf.Asia 2020):

* In Robert's patch [1] (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don't
think that's really a good idea, because the query might complete
before returning the tuples.  So I modified that function so that a
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch.  I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().

* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.

* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.

* In Robert's patch, the FDW callback function
ForeignAsyncConfigureWait() allowed multiple events to be configured,
but I limited that function to only allow a single event to be
configured, just for simplicity.

* I haven't yet added some planner/resowner changes from Horiguchi-san's patch.

* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2].  I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.

Comments welcome!  The attached is still WIP and maybe I'm missing
something, though.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thanks you for the new version.

At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Yes, if there are no objections from you or Thomas or Robert or anyone
> > else, I'll update Robert's patch as such.
> 
> Here is a new version of the patch (as promised in the developer
> unconference in PostgresConf.CN & PGConf.Asia 2020):
> 
> * In Robert's patch [1] (and Horiguchi-san's, which was created based
> on Robert's), ExecAppend() was modified to retrieve tuples from
> async-aware children *before* the tuples will be needed, but I don't

The "retrieve" means the move of a tuple from fdw to executor
(ExecAppend or ExecAsync) layer?

> think that's really a good idea, because the query might complete
> before returning the tuples.  So I modified that function so that a

I'm not sure how it matters. Anyway the fdw holds up to tens of tuples
before the executor actually make requests for them.  The reason for
the early fetching is letting fdw send the next request as early as
possible. (However, I didn't measure the effect of the
nodeAppend-level prefetching.)

> tuple is retrieved from an async-aware child *when* it is needed, like
> Thomas' patch.  I used FDW callback functions proposed by Robert, but
> introduced another FDW callback function ForeignAsyncBegin() for each
> async-aware child to start an asynchronous data fetch at the first
> call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().

Even though the terminology is not officially determined, in the past
discussions "async-aware" meant "can handle async-capable subnodes"
and "async-capable" is used as "can run asynchronously".  Likewise you
seem to have changed the meaning of as_needrequest from "subnodes that
needs to request for the next tuple" to "subnodes that already have
got query-send request and waiting for the result to come".  I would
argue to use the words and variables (names) in such meanings.  (Yeah,
parallel_aware is being used in that meaning, I'm not sure what is the
better wordings for the aware-capable relationship in that case.)

> * For EvalPlanQual, I modified the patch so that async-aware children
> are treated as if they were synchronous when executing EvalPlanQual.

Doesn't async execution accelerate the epq-fetching?  Or does
async-execution goes into trouble in the EPQ path?

> * In Robert's patch, all async-aware children below Append nodes in
> the query waiting for events to occur were managed by a single EState,
> but I modified the patch so that such children are managed by each
> Append node, like Horiguchi-san's patch and Thomas'.

Managing in Estate give advantage for push-up style executor but
managing in node_state is simpler.

> * In Robert's patch, the FDW callback function
> ForeignAsyncConfigureWait() allowed multiple events to be configured,
> but I limited that function to only allow a single event to be
> configured, just for simplicity.

No problem for me.

> * I haven't yet added some planner/resowner changes from Horiguchi-san's patch.
> 
> * I haven't yet done anything about the issue on postgres_fdw's
> handling of concurrent data fetches by multiple ForeignScan nodes
> (below *different* Append nodes in the query) using the same
> connection discussed in [2].  I modified the patch to just disable
> applying this feature to problematic test cases in the postgres_fdw
> regression tests, by a new GUC enable_async_append.
> 
> Comments welcome!  The attached is still WIP and maybe I'm missing
> something, though.
> 
> Best regards,
> Etsuro Fujita
> 
> [1] https://www.postgresql.org/message-id/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Hello.

I looked through the nodeAppend.c and postgres_fdw.c part and those
are I think the core of this patch.

-         * figure out which subplan we are currently processing
+         * try to get a tuple from async subplans
+         */
+        if (!bms_is_empty(node->as_needrequest) ||
+            (node->as_syncdone && !bms_is_empty(node->as_asyncpending)))
+        {
+            if (ExecAppendAsyncGetNext(node, &result))
+                return result;

The function ExecAppendAsyncGetNext() is a function called only here,
and contains only 31 lines.  It doesn't seem to me that the separation
makes the code more readable.

-        /* choose new subplan; if none, we're done */
-        if (!node->choose_next_subplan(node))
+        /* wait or poll async events */
+        if (!bms_is_empty(node->as_asyncpending))
+        {
+            Assert(!node->as_syncdone);
+            Assert(bms_is_empty(node->as_needrequest));
+            ExecAppendAsyncEventWait(node);

You moved the function to wait for events from execAsync to
nodeAppend. The former is a generic module that can be used from any
kind of executor nodes, but the latter is specialized for nodeAppend.
In other words, the abstraction level is lowered here.  What is the
reason for the change?


+        /* Perform the actual callback. */
+        ExecAsyncRequest(areq);
+        if (ExecAppendAsyncResponse(areq))
+        {
+            Assert(!TupIsNull(areq->result));
+            *result = areq->result;

Putting aside the name of the functions, the first two function are
used only this way at only two places. ExecAsyncRequest(areq) tells
fdw to store the first tuple among the already received ones to areq,
and ExecAppendAsyncResponse(areq) is checking the result is actually
set. Finally the result is retrieved directory from areq->result.
What is the reason that the two functions are separately exists?


+            /* Perform the actual callback. */
+            ExecAsyncNotify(areq);

Mmm. The usage of the function (or its name) looks completely reverse
to me.  I think FDW should NOTIFY to exec nodes that the new tuple
gets available but the reverse is nonsense. What the function is
actually doing is to REQUEST fdw to fetch tuples that are expected to
have arrived, which is different from what the name suggests.


postgres_fdw.c

> postgresIterateForeignScan(ForeignScanState *node)
> {
>     PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
>     TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> 
>     /*
>      * If this is the first call after Begin or ReScan, we need to create the
>      * cursor on the remote side.
>      */
>     if (!fsstate->cursor_exists)
>         create_cursor(node);

With the patch, cursors are also created in another place so at least
the comment is wrong.  That being said, I think we should unify the
code except the differences between async and sync.  For example, if
the fetch_more_data_begin() needs to be called only for async
fetching, the cursor should be created before calling the function, in
the code path common with sync fetching.


+
+        /* If this was the second part of an async request, we must fetch until NULL. */
+        if (fsstate->async_aware)
+        {
+            /* call once and raise error if not NULL as expected? */
+            while (PQgetResult(conn) != NULL)
+                ;
+            fsstate->conn_state->async_query_sent = false;
+        }

PQgetResult() receives the result of a query at once. This code means
several queries (FETCHes) are queued in, and we discard the result
except the last one.  Actually the res is just PQclear'd just after so
this just discards *all* result of maybe more than one FETCHes.  I
think something's wrong if we need this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Fri, 20 Nov 2020 20:16:42 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> +        /* If this was the second part of an async request, we must fetch until NULL. */
me> +        if (fsstate->async_aware)
me> +        {
me> +            /* call once and raise error if not NULL as expected? */
me> +            while (PQgetResult(conn) != NULL)
me> +                ;
me> +            fsstate->conn_state->async_query_sent = false;
me> +        }
me> 
me> PQgetResult() receives the result of a query at once. This code means
me> several queries (FETCHes) are queued in, and we discard the result
me> except the last one.  Actually the res is just PQclear'd just after so
me> this just discards *all* result of maybe more than one FETCHes.  I
me> think something's wrong if we need this.

I was wrong, it is worse. That leaks the returned PGresult.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
"movead.li@highgo.ca"
Date:

I test the patch and occur several issues as blow:

Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.

I research the code and do such change then the Assert problom disappear.

@@ -1004,6 +1004,7 @@ ExecAppendAsyncResponse(AsyncRequest *areq) bms_del_member(node->as_needrequest, areq->request_index); node->as_asyncpending = bms_add_member(node->as_asyncpending, areq->request_index); + node->as_lastasyncplan = INVALID_SUBPLAN_INDEX; return false; }

Issue two:
Then I test and find if I have sync subplan and async sunbplan, it will run over
the sync subplan then the async turn, I do not know if it is intent.

Issue three:
After code change mentioned in the Issue one, I can not get performance improvement.
I query on partitioned table and all sub-partition the time spent on partitioned table
always same as the sum of all sub-partition.

Sorry if I have something wrong when test the patch.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: Asynchronous Append on postgres_fdw nodes.

From
Craig Ringer
Date:
"On Thu, Nov 26, 2020 at 9:28 AM movead.li@highgo.ca
<movead.li@highgo.ca> wrote:
>
>
> I test the patch and occur several issues as blow:
>
> Issue one:
> Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
> ExecAppendAsyncRequest() function when I use more than two foreign table
> on different foreign server.
>
> I research the code and do such change then the Assert problem disappear.
>
> @@ -1004,6 +1004,7 @@ ExecAppendAsyncResponse(AsyncRequest *areq) bms_del_member(node->as_needrequest,
areq->request_index);node->as_asyncpending = bms_add_member(node->as_asyncpending, areq->request_index); +
node->as_lastasyncplan= INVALID_SUBPLAN_INDEX; return false; } 
>
> Issue two:
> Then I test and find if I have sync subplan and async sunbplan, it will run over
> the sync subplan then the async turn, I do not know if it is intent.

I only just noticed this patch. It's very interesting to me given the
ongoing work happening on postgres_fdw batching and the way libpq
pipelining is looking like it's getting there. I'll study up on the
executor and see if I can understand this well enough to hack together
a PoC to make it use libpq batching.

Have you taken a look at how this patch may overlap with those?

See -hackers threads:

* "POC: postgres_fdw insert batching" [1]
* "PATCH: Batch/pipelining support for libpq" [2]

[1]
https://www.postgresql.org/message-id/OSBPR01MB2982039EA967F0304CC6A3ECFE0B0@OSBPR01MB2982.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/20201026190936.GA18705@alvherre.pgsql



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 11/17/20 2:56 PM, Etsuro Fujita wrote:
> On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Comments welcome!  The attached is still WIP and maybe I'm missing
> something, though.
I reviewed your patch and used it in my TPC-H benchmarks. It is still 
WIP. Will you improve this patch?

I also want to say that, in my opinion, Horiguchi-san's version seems 
preferable: it is more structured, simple to understand, executor-native 
and allows to reduce FDW interface changes. This code really only needs 
one procedure - IsForeignPathAsyncCapable.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > * In Robert's patch [1] (and Horiguchi-san's, which was created based
> > on Robert's), ExecAppend() was modified to retrieve tuples from
> > async-aware children *before* the tuples will be needed, but I don't
>
> The "retrieve" means the move of a tuple from fdw to executor
> (ExecAppend or ExecAsync) layer?

Yes, that's what I mean.

> > think that's really a good idea, because the query might complete
> > before returning the tuples.  So I modified that function so that a
>
> I'm not sure how it matters. Anyway the fdw holds up to tens of tuples
> before the executor actually make requests for them.  The reason for
> the early fetching is letting fdw send the next request as early as
> possible. (However, I didn't measure the effect of the
> nodeAppend-level prefetching.)

I agree that that would lead to an improved efficiency in some cases,
but I still think that that would be useless in some other cases like
SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
would get worse if we support Append on top of joins or aggregates
over ForeignScans, which would be more expensive to perform than these
ForeignScans.

If we do prefetching, I think it would be better that it’s the
responsibility of the FDW to do prefetching, and I think that that
could be done by letting the FDW to start another data fetch,
independently of the core, in the ForeignAsyncNotify callback routine,
which I revived from Robert's original patch.  I think that that would
be more efficient, because the FDW would no longer need to wait until
all buffered tuples are returned to the core.  In the WIP patch, I
only allowed the callback routine to put the corresponding ForeignScan
node into a state where it’s either ready for a new request or needing
a callback for another data fetch, but I think we could probably relax
the restriction so that the ForeignScan node can be put into another
state where it’s ready for a new request while needing a callback for
the prefetch.

> > tuple is retrieved from an async-aware child *when* it is needed, like
> > Thomas' patch.  I used FDW callback functions proposed by Robert, but
> > introduced another FDW callback function ForeignAsyncBegin() for each
> > async-aware child to start an asynchronous data fetch at the first
> > call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().
>
> Even though the terminology is not officially determined, in the past
> discussions "async-aware" meant "can handle async-capable subnodes"
> and "async-capable" is used as "can run asynchronously".

Thanks for the explanation!

> Likewise you
> seem to have changed the meaning of as_needrequest from "subnodes that
> needs to request for the next tuple" to "subnodes that already have
> got query-send request and waiting for the result to come".

No.  I think I might slightly change the original definition of
as_needrequest, though.

> I would
> argue to use the words and variables (names) in such meanings.

I think the word "aware" has a broader meaning, so the naming as
proposed would be OK IMO.  But actually, I don't have any strong
opinion about that, so I'll change it as explained.

> > * For EvalPlanQual, I modified the patch so that async-aware children
> > are treated as if they were synchronous when executing EvalPlanQual.
>
> Doesn't async execution accelerate the epq-fetching?  Or does
> async-execution goes into trouble in the EPQ path?

The reason why I disabled async execution when executing EPQ is to
avoid sending asynchronous queries to the remote sides, which would be
useless, because scan tuples for an EPQ recheck are obtained in a
dedicated way.

> > * In Robert's patch, all async-aware children below Append nodes in
> > the query waiting for events to occur were managed by a single EState,
> > but I modified the patch so that such children are managed by each
> > Append node, like Horiguchi-san's patch and Thomas'.
>
> Managing in Estate give advantage for push-up style executor but
> managing in node_state is simpler.

What do you mean by "push-up style executor"?

Thanks for the review!  Sorry for the delay.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> I looked through the nodeAppend.c and postgres_fdw.c part and those
> are I think the core of this patch.

Thanks again for the review!

> -                * figure out which subplan we are currently processing
> +                * try to get a tuple from async subplans
> +                */
> +               if (!bms_is_empty(node->as_needrequest) ||
> +                       (node->as_syncdone && !bms_is_empty(node->as_asyncpending)))
> +               {
> +                       if (ExecAppendAsyncGetNext(node, &result))
> +                               return result;
>
> The function ExecAppendAsyncGetNext() is a function called only here,
> and contains only 31 lines.  It doesn't seem to me that the separation
> makes the code more readable.

Considering the original ExecAppend() is about 50 lines long, 31 lines
of code would not be small.  So I'd vote for separating it into
another function as proposed.

> -               /* choose new subplan; if none, we're done */
> -               if (!node->choose_next_subplan(node))
> +               /* wait or poll async events */
> +               if (!bms_is_empty(node->as_asyncpending))
> +               {
> +                       Assert(!node->as_syncdone);
> +                       Assert(bms_is_empty(node->as_needrequest));
> +                       ExecAppendAsyncEventWait(node);
>
> You moved the function to wait for events from execAsync to
> nodeAppend. The former is a generic module that can be used from any
> kind of executor nodes, but the latter is specialized for nodeAppend.
> In other words, the abstraction level is lowered here.  What is the
> reason for the change?

The reason is just because that function is only called from
ExecAppend().  I put some functions only called from nodeAppend.c in
execAsync.c, though.

> +               /* Perform the actual callback. */
> +               ExecAsyncRequest(areq);
> +               if (ExecAppendAsyncResponse(areq))
> +               {
> +                       Assert(!TupIsNull(areq->result));
> +                       *result = areq->result;
>
> Putting aside the name of the functions, the first two function are
> used only this way at only two places. ExecAsyncRequest(areq) tells
> fdw to store the first tuple among the already received ones to areq,
> and ExecAppendAsyncResponse(areq) is checking the result is actually
> set. Finally the result is retrieved directory from areq->result.
> What is the reason that the two functions are separately exists?

I think that when an async-aware node gets a tuple from an
async-capable node, they should use ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() [1].  I modified the
patch so that ExecAppendAsyncResponse() is called from Append, but to
support bubbling up the plan tree discussed in [2], I think it should
be called from ForeignScans (the sides of async-capable nodes).  Am I
right?  Anyway, I’ll rename ExecAppendAyncResponse() to the one
proposed in Robert’s original patch.

> +                       /* Perform the actual callback. */
> +                       ExecAsyncNotify(areq);
>
> Mmm. The usage of the function (or its name) looks completely reverse
> to me.  I think FDW should NOTIFY to exec nodes that the new tuple
> gets available but the reverse is nonsense. What the function is
> actually doing is to REQUEST fdw to fetch tuples that are expected to
> have arrived, which is different from what the name suggests.

As mentioned in a previous email, this is an FDW callback routine
revived from Robert’s patch.  I think the naming is reasonable,
because the callback routine notifies the FDW of readiness of a file
descriptor.  And actually, the callback routine tells the core whether
the corresponding ForeignScan node is ready for a new request or not,
by setting the callback_pending flag accordingly.

> postgres_fdw.c
>
> > postgresIterateForeignScan(ForeignScanState *node)
> > {
> >       PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
> >       TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> >
> >       /*
> >        * If this is the first call after Begin or ReScan, we need to create the
> >        * cursor on the remote side.
> >        */
> >       if (!fsstate->cursor_exists)
> >               create_cursor(node);
>
> With the patch, cursors are also created in another place so at least
> the comment is wrong.

Good catch!  Will fix.

> That being said, I think we should unify the
> code except the differences between async and sync.  For example, if
> the fetch_more_data_begin() needs to be called only for async
> fetching, the cursor should be created before calling the function, in
> the code path common with sync fetching.

I think that that would make the code easier to understand, but I’m
not 100% sure we really need to do so.

Best regards,
Etsuro Fujita

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYrbgTBnLwnr1v%3Dpk%2BC%3DznWg7AgV9%3DM9ehrq6TDexPQNw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmoZSWnhy%3DSB3ggQcB6EqKxzbNeNn%3DEfwARnCS5tyhhBNcw%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Tue, 17 Nov 2020 18:56:02 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > > * In Robert's patch [1] (and Horiguchi-san's, which was created based
> > > on Robert's), ExecAppend() was modified to retrieve tuples from
> > > async-aware children *before* the tuples will be needed, but I don't
> >
> > The "retrieve" means the move of a tuple from fdw to executor
> > (ExecAppend or ExecAsync) layer?
> 
> Yes, that's what I mean.
> 
> > > think that's really a good idea, because the query might complete
> > > before returning the tuples.  So I modified that function so that a
> >
> > I'm not sure how it matters. Anyway the fdw holds up to tens of tuples
> > before the executor actually make requests for them.  The reason for
> > the early fetching is letting fdw send the next request as early as
> > possible. (However, I didn't measure the effect of the
> > nodeAppend-level prefetching.)
> 
> I agree that that would lead to an improved efficiency in some cases,
> but I still think that that would be useless in some other cases like
> SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> would get worse if we support Append on top of joins or aggregates
> over ForeignScans, which would be more expensive to perform than these
> ForeignScans.

I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
for many times is more common than fetching all or "LIMIT <many
multiples of fetch_size>", that discussion would be convincing... Is
it really the case?

Since core knows of async execution, I think if we disable async
exection, it should be decided by planner, which knows how many tuples
are expected to be returned.  On the other hand the most apparent
criteria for whether to enable async or not would be fetch_size, which
is fdw's secret. Thus we could rename ForeignPathAsyncCapable() to
something like ForeignPathRunAsync(), true from which means "the FDW
is telling that it can run async and is thinking that the given number
of tuples will be fetched at once.".

> If we do prefetching, I think it would be better that it’s the
> responsibility of the FDW to do prefetching, and I think that that
> could be done by letting the FDW to start another data fetch,
> independently of the core, in the ForeignAsyncNotify callback routine,

FDW does prefetching (if it means sending request to remote) in my
patch, so I agree to that.  It suspect that you were intended to say
the opposite.  The core (ExecAppendAsyncGetNext()) controls
prefetching in your patch.

> which I revived from Robert's original patch.  I think that that would
> be more efficient, because the FDW would no longer need to wait until
> all buffered tuples are returned to the core.  In the WIP patch, I

I don't understand. My patch sends a prefetch-query as soon as all the
tuples of the last remote-request is stored into FDW storage.  The
reason for removing ExecAsyncNotify() was it is just redundant as far
as concerning Append asynchrony. But I particulary oppose to revive
the function.

> only allowed the callback routine to put the corresponding ForeignScan
> node into a state where it’s either ready for a new request or needing
> a callback for another data fetch, but I think we could probably relax
> the restriction so that the ForeignScan node can be put into another
> state where it’s ready for a new request while needing a callback for
> the prefetch.

I don't understand this, too. ExecAsyncNotify() doesn't touch any of
the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
your patch.  Am I looking into something wrong?  I'm looking
async-wip-2020-11-17.patch.

(By the way, it is one of those that make the code hard to read to me
that the "callback" means "calling an API function".  I think none of
them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
"callback".)

> > > tuple is retrieved from an async-aware child *when* it is needed, like
> > > Thomas' patch.  I used FDW callback functions proposed by Robert, but
> > > introduced another FDW callback function ForeignAsyncBegin() for each
> > > async-aware child to start an asynchronous data fetch at the first
> > > call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().
> >
> > Even though the terminology is not officially determined, in the past
> > discussions "async-aware" meant "can handle async-capable subnodes"
> > and "async-capable" is used as "can run asynchronously".
> 
> Thanks for the explanation!
> 
> > Likewise you
> > seem to have changed the meaning of as_needrequest from "subnodes that
> > needs to request for the next tuple" to "subnodes that already have
> > got query-send request and waiting for the result to come".
> 
> No.  I think I might slightly change the original definition of
> as_needrequest, though.

Mmm, sorry. I may have been perplexed by the comment below, which is
also added to ExecAsyncNotify().

ExecAppendAsyncRequest:
>        Assert(bms_is_member(i, node->as_needrequest));
>
>        /* Perform the actual callback. */
>        ExecAsyncRequest(areq);
>        if (ExecAppendAsyncResponse(areq))
>        {
>            Assert(!TupIsNull(areq->result));
>            *result = areq->result;
>            return true;
>        }



> > I would
> > argue to use the words and variables (names) in such meanings.
> 
> I think the word "aware" has a broader meaning, so the naming as
> proposed would be OK IMO.  But actually, I don't have any strong
> opinion about that, so I'll change it as explained.

Thanks.

> > > * For EvalPlanQual, I modified the patch so that async-aware children
> > > are treated as if they were synchronous when executing EvalPlanQual.
> >
> > Doesn't async execution accelerate the epq-fetching?  Or does
> > async-execution goes into trouble in the EPQ path?
> 
> The reason why I disabled async execution when executing EPQ is to
> avoid sending asynchronous queries to the remote sides, which would be
> useless, because scan tuples for an EPQ recheck are obtained in a
> dedicated way.

If EPQ is performed onto Append, I think it should gain from
asynchronous execution since it is going to fetch *a* tuple from
several partitions or children.  I believe EPQ doesn't contain Append
in major cases, though.  (Or I didn't come up with the steps for the
case to happen...)


> > > * In Robert's patch, all async-aware children below Append nodes in
> > > the query waiting for events to occur were managed by a single EState,
> > > but I modified the patch so that such children are managed by each
> > > Append node, like Horiguchi-san's patch and Thomas'.
> >
> > Managing in Estate give advantage for push-up style executor but
> > managing in node_state is simpler.
> 
> What do you mean by "push-up style executor"?

The reverse of the volcano-style executor, which enters from the
topmost node and down to the bottom.  In the "push-up stule executor",
the bottom-most nodes fires by a certain trigger then every
intermediate nodes throws up the result to the parent until reaching
the topmost node.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > I looked through the nodeAppend.c and postgres_fdw.c part and those
> > are I think the core of this patch.
> 
> Thanks again for the review!
> 
> > -                * figure out which subplan we are currently processing
> > +                * try to get a tuple from async subplans
> > +                */
> > +               if (!bms_is_empty(node->as_needrequest) ||
> > +                       (node->as_syncdone && !bms_is_empty(node->as_asyncpending)))
> > +               {
> > +                       if (ExecAppendAsyncGetNext(node, &result))
> > +                               return result;
> >
> > The function ExecAppendAsyncGetNext() is a function called only here,
> > and contains only 31 lines.  It doesn't seem to me that the separation
> > makes the code more readable.
> 
> Considering the original ExecAppend() is about 50 lines long, 31 lines
> of code would not be small.  So I'd vote for separating it into
> another function as proposed.

Ok, I no longer oppose to separating some part from ExecAppend().

> > -               /* choose new subplan; if none, we're done */
> > -               if (!node->choose_next_subplan(node))
> > +               /* wait or poll async events */
> > +               if (!bms_is_empty(node->as_asyncpending))
> > +               {
> > +                       Assert(!node->as_syncdone);
> > +                       Assert(bms_is_empty(node->as_needrequest));
> > +                       ExecAppendAsyncEventWait(node);
> >
> > You moved the function to wait for events from execAsync to
> > nodeAppend. The former is a generic module that can be used from any
> > kind of executor nodes, but the latter is specialized for nodeAppend.
> > In other words, the abstraction level is lowered here.  What is the
> > reason for the change?
> 
> The reason is just because that function is only called from
> ExecAppend().  I put some functions only called from nodeAppend.c in
> execAsync.c, though.

(I think) You told me that you preferred the genericity of the
original interface, but you're doing the opposite.  If you think we
can move such a generic feature to a part of Append node, all other
features can be move the same way.  I guess there's a reason you want
only the this specific feature out of all of them be Append-spcific
and I want to know that.

> > +               /* Perform the actual callback. */
> > +               ExecAsyncRequest(areq);
> > +               if (ExecAppendAsyncResponse(areq))
> > +               {
> > +                       Assert(!TupIsNull(areq->result));
> > +                       *result = areq->result;
> >
> > Putting aside the name of the functions, the first two function are
> > used only this way at only two places. ExecAsyncRequest(areq) tells
> > fdw to store the first tuple among the already received ones to areq,
> > and ExecAppendAsyncResponse(areq) is checking the result is actually
> > set. Finally the result is retrieved directory from areq->result.
> > What is the reason that the two functions are separately exists?
> 
> I think that when an async-aware node gets a tuple from an
> async-capable node, they should use ExecAsyncRequest() /
> ExecAyncHogeResponse() rather than ExecProcNode() [1].  I modified the
> patch so that ExecAppendAsyncResponse() is called from Append, but to
> support bubbling up the plan tree discussed in [2], I think it should
> be called from ForeignScans (the sides of async-capable nodes).  Am I
> right?  Anyway, I’ll rename ExecAppendAyncResponse() to the one
> proposed in Robert’s original patch.

Even though I understand the concept but to make work it we need to
remember the parent *async* node somewhere. In my faint memory the
very early patch did something like that.

So I think just providing ExecAsyncResponse() doesn't make it
true. But if we make it true, it would be something like
partially-reversed steps from what the current Exec*()s do for some of
the existing nodes and further code is required for some other nodes
like WindowFunction. Bubbling up works only in very simple cases where
a returned tuple is thrown up to further parent as-is or at least when
the node convers a tuple into another shape. If an async-receiver node
wants to process multiple tuples from a child or from multiple
children, it is no longer be just a bubbling up.

That being said, we could avoid passing (a-kind-of) side-channel
information when ExecProcNode is called by providing
ExecAsyncResponse(). But I don't think the "side-channel" is not a
problem since it is just another state of the node.


And.. I think the reason I feel uneasy for the patch may be that the
patch uses the interface names in somewhat different context.
Origianlly the fraemework resides in-between executor nodes, not on a
node of either side. ExecAsyncNotify() notifies the requestee about an
event and ExecAsyncResonse() notifies the requestor about a new
tuple. I don't feel strangeness in this usage. But this patch feels to
me using the same names in different (and somewhat wrong) context.

> > +                       /* Perform the actual callback. */
> > +                       ExecAsyncNotify(areq);
> >
> > Mmm. The usage of the function (or its name) looks completely reverse
> > to me.  I think FDW should NOTIFY to exec nodes that the new tuple
> > gets available but the reverse is nonsense. What the function is
> > actually doing is to REQUEST fdw to fetch tuples that are expected to
> > have arrived, which is different from what the name suggests.
> 
> As mentioned in a previous email, this is an FDW callback routine
> revived from Robert’s patch.  I think the naming is reasonable,
> because the callback routine notifies the FDW of readiness of a file
> descriptor.  And actually, the callback routine tells the core whether
> the corresponding ForeignScan node is ready for a new request or not,
> by setting the callback_pending flag accordingly.

Hmm. Agreed.  The word "callback" is also used there [3]...  I
remember and it seems reasonable that the core calls AsyncNotify() on
FDW and the FDW calls ExecForeignScan as a response to it and notify
back to core of that using ExecAsyncRequestDone().  But the patch here
feels a little strange, or uneasy, to me.

[3] https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp

> > postgres_fdw.c
> >
> > > postgresIterateForeignScan(ForeignScanState *node)
> > > {
> > >       PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
> > >       TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> > >
> > >       /*
> > >        * If this is the first call after Begin or ReScan, we need to create the
> > >        * cursor on the remote side.
> > >        */
> > >       if (!fsstate->cursor_exists)
> > >               create_cursor(node);
> >
> > With the patch, cursors are also created in another place so at least
> > the comment is wrong.
> 
> Good catch!  Will fix.
> 
> > That being said, I think we should unify the
> > code except the differences between async and sync.  For example, if
> > the fetch_more_data_begin() needs to be called only for async
> > fetching, the cursor should be created before calling the function, in
> > the code path common with sync fetching.
> 
> I think that that would make the code easier to understand, but I’m
> not 100% sure we really need to do so.

And I believe that we don't tolerate even the slightest performance
degradation.

> Best regards,
> Etsuro Fujita
> 
> [1]
https://www.postgresql.org/message-id/CA%2BTgmoYrbgTBnLwnr1v%3Dpk%2BC%3DznWg7AgV9%3DM9ehrq6TDexPQNw%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CA%2BTgmoZSWnhy%3DSB3ggQcB6EqKxzbNeNn%3DEfwARnCS5tyhhBNcw%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > The reason for
> > > the early fetching is letting fdw send the next request as early as
> > > possible. (However, I didn't measure the effect of the
> > > nodeAppend-level prefetching.)
> >
> > I agree that that would lead to an improved efficiency in some cases,
> > but I still think that that would be useless in some other cases like
> > SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> > would get worse if we support Append on top of joins or aggregates
> > over ForeignScans, which would be more expensive to perform than these
> > ForeignScans.
>
> I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> for many times is more common than fetching all or "LIMIT <many
> multiples of fetch_size>", that discussion would be convincing... Is
> it really the case?

I don't have a clear answer for that...  Performance in the case you
mentioned would be improved by async execution without prefetching by
Append, so it seemed reasonable to me to remove that prefetching to
avoid unnecessary overheads in the case I mentioned.  BUT: I started
to think my proposal, which needs an additional FDW callback routine
(ie, ForeignAsyncBegin()), might be a bad idea, because it would
increase the burden on FDW authors.

> > If we do prefetching, I think it would be better that it’s the
> > responsibility of the FDW to do prefetching, and I think that that
> > could be done by letting the FDW to start another data fetch,
> > independently of the core, in the ForeignAsyncNotify callback routine,
>
> FDW does prefetching (if it means sending request to remote) in my
> patch, so I agree to that.  It suspect that you were intended to say
> the opposite.  The core (ExecAppendAsyncGetNext()) controls
> prefetching in your patch.

No.  That function just tries to retrieve a tuple from any of the
ready subplans (ie, subplans marked as as_needrequest).

> > which I revived from Robert's original patch.  I think that that would
> > be more efficient, because the FDW would no longer need to wait until
> > all buffered tuples are returned to the core.  In the WIP patch, I
>
> I don't understand. My patch sends a prefetch-query as soon as all the
> tuples of the last remote-request is stored into FDW storage.  The
> reason for removing ExecAsyncNotify() was it is just redundant as far
> as concerning Append asynchrony. But I particulary oppose to revive
> the function.

Sorry, my explanation was not good, but what I'm saying here is about
my patch, not your patch.  I think this FDW callback routine would be
useful; it allows an FDW to perform another asynchronous data fetch
before delivering a tuple to the core as discussed in [1].  Also, it
would be useful when extending to the case where we have intermediate
nodes between an Append and a ForeignScan such as joins or aggregates,
which I'll explain below.

> > only allowed the callback routine to put the corresponding ForeignScan
> > node into a state where it’s either ready for a new request or needing
> > a callback for another data fetch, but I think we could probably relax
> > the restriction so that the ForeignScan node can be put into another
> > state where it’s ready for a new request while needing a callback for
> > the prefetch.
>
> I don't understand this, too. ExecAsyncNotify() doesn't touch any of
> the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
> your patch.  Am I looking into something wrong?  I'm looking
> async-wip-2020-11-17.patch.

In the WIP patch I post, these bitmaps are modified in the core side
based on the callback_pending and request_complete flags in
AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).

> (By the way, it is one of those that make the code hard to read to me
> that the "callback" means "calling an API function".  I think none of
> them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
> "callback".)

I thought the word “callback” was OK, because these functions would
call the corresponding FDW callback routines, but I’ll revise the
wording.

> > The reason why I disabled async execution when executing EPQ is to
> > avoid sending asynchronous queries to the remote sides, which would be
> > useless, because scan tuples for an EPQ recheck are obtained in a
> > dedicated way.
>
> If EPQ is performed onto Append, I think it should gain from
> asynchronous execution since it is going to fetch *a* tuple from
> several partitions or children.  I believe EPQ doesn't contain Append
> in major cases, though.  (Or I didn't come up with the steps for the
> case to happen...)

Sorry, I don’t understand this part.  Could you elaborate a bit more on it?

> > What do you mean by "push-up style executor"?
>
> The reverse of the volcano-style executor, which enters from the
> topmost node and down to the bottom.  In the "push-up stule executor",
> the bottom-most nodes fires by a certain trigger then every
> intermediate nodes throws up the result to the parent until reaching
> the topmost node.

That is what I'm thinking to be able to support the case I mentioned
above.  I think that that would allow us to find ready subplans
efficiently from occurred wait events in ExecAppendAsyncEventWait().
Consider a plan like this:

Append
-> Nested Loop
  -> Foreign Scan on a
  -> Foreign Scan on b
-> ...

I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
Loop are all async-capable and that we have somewhere in the executor
an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
Scan on a", an AsyncRequest with requestor="Nested Loop" and
requestee="Foreign Scan on b", and an AsyncRequest with
requestor="Append" and requestee="Nested Loop".  In
ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
becomes ready, we would call ForeignAsyncNotify() for a, and if it
returns a tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
Nested Loop*.  Nested Loop would then call ExecAsyncRequest() for the
inner requestee node (ie, Foreign Scan on b; I assume here that it is
a foreign scan parameterized by a).  If Foreign Scan on b returns a
tuple back to the requestor node (ie, Nested Loop) (using
ExecAsyncResponse()), then Nested Loop would match the tuples from the
outer and inner sides.  If they match, the join result would be
returned back to the requestor node (ie, Append) (using
ExecAsyncResponse()), marking the Nested Loop subplan as
as_needrequest.  Otherwise, Nested Loop would call ExecAsyncRequest()
for the inner requestee node for the next tuple, and so on.  If
ExecAsyncRequest() can't return a tuple immediately, we would wait
until a file descriptor for foreign table b becomes ready; we would
start from calling ForeignAsyncNotify() for b when the file descriptor
becomes ready.  In this way we could find ready subplans efficiently
from occurred wait events in ExecAppendAsyncEventWait() when extending
to the case where subplans are joins or aggregates over Foreign Scans,
I think.  Maybe I’m missing something, though.

Thanks for the comments!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK153oorYtTpW_-aZrjH-iecHbykX7qbxX_5630ZK8nqVHg%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Dec 14, 2020 at 5:56 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Sat, 12 Dec 2020 19:06:51 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > On Fri, Nov 20, 2020 at 8:16 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:

> > > +               /* wait or poll async events */
> > > +               if (!bms_is_empty(node->as_asyncpending))
> > > +               {
> > > +                       Assert(!node->as_syncdone);
> > > +                       Assert(bms_is_empty(node->as_needrequest));
> > > +                       ExecAppendAsyncEventWait(node);
> > >
> > > You moved the function to wait for events from execAsync to
> > > nodeAppend. The former is a generic module that can be used from any
> > > kind of executor nodes, but the latter is specialized for nodeAppend.
> > > In other words, the abstraction level is lowered here.  What is the
> > > reason for the change?
> >
> > The reason is just because that function is only called from
> > ExecAppend().  I put some functions only called from nodeAppend.c in
> > execAsync.c, though.
>
> (I think) You told me that you preferred the genericity of the
> original interface, but you're doing the opposite.  If you think we
> can move such a generic feature to a part of Append node, all other
> features can be move the same way.  I guess there's a reason you want
> only the this specific feature out of all of them be Append-spcific
> and I want to know that.

The reason is that I’m thinking to add a small feature for
multiplexing Append subplans, not a general feature for async
execution as discussed in [1], because this would be an interim
solution until the executor rewrite is done.

> > I think that when an async-aware node gets a tuple from an
> > async-capable node, they should use ExecAsyncRequest() /
> > ExecAyncHogeResponse() rather than ExecProcNode() [1].  I modified the
> > patch so that ExecAppendAsyncResponse() is called from Append, but to
> > support bubbling up the plan tree discussed in [2], I think it should
> > be called from ForeignScans (the sides of async-capable nodes).  Am I
> > right?  Anyway, I’ll rename ExecAppendAyncResponse() to the one
> > proposed in Robert’s original patch.
>
> Even though I understand the concept but to make work it we need to
> remember the parent *async* node somewhere. In my faint memory the
> very early patch did something like that.
>
> So I think just providing ExecAsyncResponse() doesn't make it
> true. But if we make it true, it would be something like
> partially-reversed steps from what the current Exec*()s do for some of
> the existing nodes and further code is required for some other nodes
> like WindowFunction. Bubbling up works only in very simple cases where
> a returned tuple is thrown up to further parent as-is or at least when
> the node convers a tuple into another shape. If an async-receiver node
> wants to process multiple tuples from a child or from multiple
> children, it is no longer be just a bubbling up.

I explained the meaning of “bubbling up the plan tree” in a previous
email I sent a moment ago.

> And.. I think the reason I feel uneasy for the patch may be that the
> patch uses the interface names in somewhat different context.
> Origianlly the fraemework resides in-between executor nodes, not on a
> node of either side. ExecAsyncNotify() notifies the requestee about an
> event and ExecAsyncResonse() notifies the requestor about a new
> tuple. I don't feel strangeness in this usage. But this patch feels to
> me using the same names in different (and somewhat wrong) context.

Sorry, this is a WIP patch.  Will fix.

> > > +                       /* Perform the actual callback. */
> > > +                       ExecAsyncNotify(areq);
> > >
> > > Mmm. The usage of the function (or its name) looks completely reverse
> > > to me.

> > As mentioned in a previous email, this is an FDW callback routine
> > revived from Robert’s patch.  I think the naming is reasonable,
> > because the callback routine notifies the FDW of readiness of a file
> > descriptor.  And actually, the callback routine tells the core whether
> > the corresponding ForeignScan node is ready for a new request or not,
> > by setting the callback_pending flag accordingly.
>
> Hmm. Agreed.  The word "callback" is also used there [3]...  I
> remember and it seems reasonable that the core calls AsyncNotify() on
> FDW and the FDW calls ExecForeignScan as a response to it and notify
> back to core of that using ExecAsyncRequestDone().  But the patch here
> feels a little strange, or uneasy, to me.

I’m not sure what I should do to improve the patch.  Could you
elaborate a bit more on this part?

> > > postgres_fdw.c
> > >
> > > > postgresIterateForeignScan(ForeignScanState *node)
> > > > {
> > > >       PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
> > > >       TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
> > > >
> > > >       /*
> > > >        * If this is the first call after Begin or ReScan, we need to create the
> > > >        * cursor on the remote side.
> > > >        */
> > > >       if (!fsstate->cursor_exists)
> > > >               create_cursor(node);

> > > That being said, I think we should unify the
> > > code except the differences between async and sync.  For example, if
> > > the fetch_more_data_begin() needs to be called only for async
> > > fetching, the cursor should be created before calling the function, in
> > > the code path common with sync fetching.
> >
> > I think that that would make the code easier to understand, but I’m
> > not 100% sure we really need to do so.
>
> And I believe that we don't tolerate even the slightest performance
> degradation.

In the case of async execution, the cursor would have already been
created before we get here as mentioned by you, so we would just skip
create_cursor() in that case.  I don’t think that that would degrade
performance noticeably.  Am I wrong?

Thanks again!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Nov 26, 2020 at 10:28 AM movead.li@highgo.ca
<movead.li@highgo.ca> wrote:
> I test the patch and occur several issues as blow:

Thank you for the review!

> Issue one:
> Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
> ExecAppendAsyncRequest() function when I use more than two foreign table
> on different foreign server.
>
> I research the code and do such change then the Assert problom disappear.

Could you show a test case causing the assertion failure?

> Issue two:
> Then I test and find if I have sync subplan and async sunbplan, it will run over
> the sync subplan then the async turn, I do not know if it is intent.

Did you use a partitioned table with only two partitions where one is
local and the other is remote?  If so, that would be expected, because
in that case, 1) the patch would first send an asynchronous query to
the remote, 2) it would then process the local partition until the
end, 3) it would then wait/poll the async event, and 4) it would
finally process the remote partition when the event occurs.

Sorry for the delay.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Dec 10, 2020 at 3:38 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 11/17/20 2:56 PM, Etsuro Fujita wrote:
> > On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Comments welcome!  The attached is still WIP and maybe I'm missing
> > something, though.
> I reviewed your patch and used it in my TPC-H benchmarks. It is still
> WIP. Will you improve this patch?

Yeah, will do.

> I also want to say that, in my opinion, Horiguchi-san's version seems
> preferable: it is more structured, simple to understand, executor-native
> and allows to reduce FDW interface changes.

I’m not sure what you mean by “executor-native”, but I partly agree
that Horiguchi-san’s version would be easier to understand, because
his version was made so that a tuple is requested from an async
subplan using our Volcano Iterator model almost as-is.  But my
concerns about his version would be: 1) it’s actually pretty invasive,
because it changes the contract of the ExecProcNode() API [1], and 2)
IIUC it wouldn’t allow us to find ready subplans from occurred wait
events when we extend to the case where subplans are joins or
aggregates over ForeignScans [2].

> This code really only needs
> one procedure - IsForeignPathAsyncCapable.

This isn’t correct: his version uses ForeignAsyncConfigureWait() as well.

Thank you for reviewing!  Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16YXCADSwsFLSxqTBBLbt3E_%3DiigKTtjS%3Ddqu%2B8K8DWCw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAPmGK16rA5ODyRrVK9iPsyW-td2RcRZXsdWoVhMmLLmUhprsTg%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Sat, Dec 19, 2020 at 5:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > > > The reason for
> > > > the early fetching is letting fdw send the next request as early as
> > > > possible. (However, I didn't measure the effect of the
> > > > nodeAppend-level prefetching.)
> > >
> > > I agree that that would lead to an improved efficiency in some cases,
> > > but I still think that that would be useless in some other cases like
> > > SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> > > would get worse if we support Append on top of joins or aggregates
> > > over ForeignScans, which would be more expensive to perform than these
> > > ForeignScans.
> >
> > I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> > for many times is more common than fetching all or "LIMIT <many
> > multiples of fetch_size>", that discussion would be convincing... Is
> > it really the case?
>
> I don't have a clear answer for that...  Performance in the case you
> mentioned would be improved by async execution without prefetching by
> Append, so it seemed reasonable to me to remove that prefetching to
> avoid unnecessary overheads in the case I mentioned.  BUT: I started
> to think my proposal, which needs an additional FDW callback routine
> (ie, ForeignAsyncBegin()), might be a bad idea, because it would
> increase the burden on FDW authors.

I dropped my proposal; I modified the patch so that ExecAppend()
requests tuples from all subplans needing a request *at once*, as
originally proposed by Robert and then you.  Please find attached a
new version of the patch.

Other changes:

* I renamed ExecAppendAsyncResponse() to what was originally proposed
by Robert, and modified the patch so that that function is called from
the requestee side, not the requestor side as in the previous version.

* I renamed the variable async_aware as explained by you.

* I tweaked comments a bit to address your comments.

* I made code simpler, and added a bit more assertions.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Dec 31, 2020 at 7:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> * I tweaked comments a bit to address your comments.

I forgot to update some comments.  :-(  Attached is a new version of
the patch updating comments further.  I did a bit of cleanup for the
postgres_fdw part as well.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Sun, Dec 20, 2020 at 5:15 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Thu, Nov 26, 2020 at 10:28 AM movead.li@highgo.ca
> <movead.li@highgo.ca> wrote:
> > Issue one:
> > Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
> > ExecAppendAsyncRequest() function when I use more than two foreign table
> > on different foreign server.
> >
> > I research the code and do such change then the Assert problom disappear.
>
> Could you show a test case causing the assertion failure?

I happened to reproduce the same failure in my environment.

I think your change would be correct, but I changed the patch so that
it doesn’t need as_lastasyncplan anymore [1].  The new version of the
patch works well for my case.  So, could you test your case with it?

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK17L0j6otssa53ZvjnCsjguJHZXaqPL2HU_LDoZ4ATZjEw%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Sat, 19 Dec 2020 17:55:22 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > > > The reason for
> > > > the early fetching is letting fdw send the next request as early as
> > > > possible. (However, I didn't measure the effect of the
> > > > nodeAppend-level prefetching.)
> > >
> > > I agree that that would lead to an improved efficiency in some cases,
> > > but I still think that that would be useless in some other cases like
> > > SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> > > would get worse if we support Append on top of joins or aggregates
> > > over ForeignScans, which would be more expensive to perform than these
> > > ForeignScans.
> >
> > I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> > for many times is more common than fetching all or "LIMIT <many
> > multiples of fetch_size>", that discussion would be convincing... Is
> > it really the case?
> 
> I don't have a clear answer for that...  Performance in the case you
> mentioned would be improved by async execution without prefetching by
> Append, so it seemed reasonable to me to remove that prefetching to
> avoid unnecessary overheads in the case I mentioned.  BUT: I started
> to think my proposal, which needs an additional FDW callback routine
> (ie, ForeignAsyncBegin()), might be a bad idea, because it would
> increase the burden on FDW authors.

I agree on the point of developers' burden.

> > > If we do prefetching, I think it would be better that it’s the
> > > responsibility of the FDW to do prefetching, and I think that that
> > > could be done by letting the FDW to start another data fetch,
> > > independently of the core, in the ForeignAsyncNotify callback routine,
> >
> > FDW does prefetching (if it means sending request to remote) in my
> > patch, so I agree to that.  It suspect that you were intended to say
> > the opposite.  The core (ExecAppendAsyncGetNext()) controls
> > prefetching in your patch.
> 
> No.  That function just tries to retrieve a tuple from any of the
> ready subplans (ie, subplans marked as as_needrequest).

Mmm. I meant that the function explicitly calls
ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
(if needed). Conversely if the function dosn't call
ExecAppendAsyncRequsest, the next request to remote doesn't
happen. That is, after the tuple buffer of FDW-side is exhausted, the
next request doesn't happen until executor requests for the next
tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
fetch_more_data_begin() following its own decision."  but this doesn't
seem to be "prefetching".

> > > which I revived from Robert's original patch.  I think that that would
> > > be more efficient, because the FDW would no longer need to wait until
> > > all buffered tuples are returned to the core.  In the WIP patch, I
> >
> > I don't understand. My patch sends a prefetch-query as soon as all the
> > tuples of the last remote-request is stored into FDW storage.  The
> > reason for removing ExecAsyncNotify() was it is just redundant as far
> > as concerning Append asynchrony. But I particulary oppose to revive
> > the function.
> 
> Sorry, my explanation was not good, but what I'm saying here is about
> my patch, not your patch.  I think this FDW callback routine would be
> useful; it allows an FDW to perform another asynchronous data fetch
> before delivering a tuple to the core as discussed in [1].  Also, it
> would be useful when extending to the case where we have intermediate
> nodes between an Append and a ForeignScan such as joins or aggregates,
> which I'll explain below.

Yeah. If a not-immediate parent of an async-capable node works as
async-aware, the notify API would have the power. So I don't object to
the API.

> > > only allowed the callback routine to put the corresponding ForeignScan
> > > node into a state where it’s either ready for a new request or needing
> > > a callback for another data fetch, but I think we could probably relax
> > > the restriction so that the ForeignScan node can be put into another
> > > state where it’s ready for a new request while needing a callback for
> > > the prefetch.
> >
> > I don't understand this, too. ExecAsyncNotify() doesn't touch any of
> > the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
> > your patch.  Am I looking into something wrong?  I'm looking
> > async-wip-2020-11-17.patch.
> 
> In the WIP patch I post, these bitmaps are modified in the core side
> based on the callback_pending and request_complete flags in
> AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).

Sorry. I think I misread you here. I agree that, the notify API is not
so useful now but would be useful if we allow notify descendents other
than immediate children. However, I stumbled on the fact that some
kinds of node doesn't return a result when all the underlying nodes
returned *a* tuple. Concretely count(*) doesn't return after *all*
tuple of the counted relation has been returned.  I remember that the
fact might be the reason why I removed the API.  After all the topmost
async-aware node must ask every immediate child if it can return a
tuple.

> > (By the way, it is one of those that make the code hard to read to me
> > that the "callback" means "calling an API function".  I think none of
> > them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
> > "callback".)
> 
> I thought the word “callback” was OK, because these functions would
> call the corresponding FDW callback routines, but I’ll revise the
> wording.

I'm not confident on the usage of "callback", though:p (Sorry.) I
believe that "callback" is a function a caller tells a callee to call
it.  In broader meaning, all FDW APIs are a function that an FDW
extention tells the core to call it (yeah, that's inversed.). However,
we don't call fread a callback of libc. They work based on slightly
different mechanism but substantially the same, I think.

> > > The reason why I disabled async execution when executing EPQ is to
> > > avoid sending asynchronous queries to the remote sides, which would be
> > > useless, because scan tuples for an EPQ recheck are obtained in a
> > > dedicated way.
> >
> > If EPQ is performed onto Append, I think it should gain from
> > asynchronous execution since it is going to fetch *a* tuple from
> > several partitions or children.  I believe EPQ doesn't contain Append
> > in major cases, though.  (Or I didn't come up with the steps for the
> > case to happen...)
> 
> Sorry, I don’t understand this part.  Could you elaborate a bit more on it?

EPQ retrieves a specific tuple from a node. If we perform EPQ on an
Append, only one of the children should offer a result tuple.  Since
Append has no idea of which of its children will offer a result, it
has no way other than asking all children until it receives a
result. If we do that, asynchronously sending a query to all nodes
would win.


> > > What do you mean by "push-up style executor"?
> >
> > The reverse of the volcano-style executor, which enters from the
> > topmost node and down to the bottom.  In the "push-up stule executor",
> > the bottom-most nodes fires by a certain trigger then every
> > intermediate nodes throws up the result to the parent until reaching
> > the topmost node.
> 
> That is what I'm thinking to be able to support the case I mentioned
> above.  I think that that would allow us to find ready subplans
> efficiently from occurred wait events in ExecAppendAsyncEventWait().
> Consider a plan like this:
> 
> Append
> -> Nested Loop
>   -> Foreign Scan on a
>   -> Foreign Scan on b
> -> ...
> 
> I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
> Loop are all async-capable and that we have somewhere in the executor
> an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
> Scan on a", an AsyncRequest with requestor="Nested Loop" and
> requestee="Foreign Scan on b", and an AsyncRequest with
> requestor="Append" and requestee="Nested Loop".  In
> ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
> becomes ready, we would call ForeignAsyncNotify() for a, and if it
> returns a tuple back to the requestor node (ie, Nested Loop) (using
> ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
> Nested Loop*.  Nested Loop would then call ExecAsyncRequest() for the
> inner requestee node (ie, Foreign Scan on b; I assume here that it is
> a foreign scan parameterized by a).  If Foreign Scan on b returns a
> tuple back to the requestor node (ie, Nested Loop) (using
> ExecAsyncResponse()), then Nested Loop would match the tuples from the
> outer and inner sides.  If they match, the join result would be
> returned back to the requestor node (ie, Append) (using
> ExecAsyncResponse()), marking the Nested Loop subplan as
> as_needrequest.  Otherwise, Nested Loop would call ExecAsyncRequest()
> for the inner requestee node for the next tuple, and so on.  If
> ExecAsyncRequest() can't return a tuple immediately, we would wait
> until a file descriptor for foreign table b becomes ready; we would
> start from calling ForeignAsyncNotify() for b when the file descriptor
> becomes ready.  In this way we could find ready subplans efficiently
> from occurred wait events in ExecAppendAsyncEventWait() when extending
> to the case where subplans are joins or aggregates over Foreign Scans,
> I think.  Maybe I’m missing something, though.

Maybe so. As I mentioned above, in the follwoing case..

  Join -1
    Join -2
      ForegnScan -A
      ForegnScan -B
    ForegnScan -C

Where the Join-1 is the leader of asynchronous fetching. Even if both
of the FS-A,B have returned one tuple each, it's unsure that Join-2
returns a tuple. I'm not sure how to resolve the situation with the
current infrastructure as-is.

So I tried a structure where when a node gets a new tuple, the node
asks the parent whether it is satisfied or not. In that trial I needed
to make every execnodes a state machine and that was pretty messy..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Jan 15, 2021 at 4:54 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Mmm. I meant that the function explicitly calls
> ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
> (if needed). Conversely if the function dosn't call
> ExecAppendAsyncRequsest, the next request to remote doesn't
> happen. That is, after the tuple buffer of FDW-side is exhausted, the
> next request doesn't happen until executor requests for the next
> tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
> fetch_more_data_begin() following its own decision."  but this doesn't
> seem to be "prefetching".

Let me explain a bit more.  Actually, the new version of the patch
allows prefetching in the FDW side; for such prefetching in
postgres_fdw, I think we could add a fetch_more_data_begin() call in
postgresForeignAsyncNotify().  But I left that for future work,
because we don’t know yet if that’s really useful.  (Another reason
why I left that is we have more important issues that should be
addressed [1], and I think addressing those issues is a requirement
for us to commit this patch, but adding such prefetching isn’t, IMO.)

> Sorry. I think I misread you here. I agree that, the notify API is not
> so useful now but would be useful if we allow notify descendents other
> than immediate children. However, I stumbled on the fact that some
> kinds of node doesn't return a result when all the underlying nodes
> returned *a* tuple. Concretely count(*) doesn't return after *all*
> tuple of the counted relation has been returned.  I remember that the
> fact might be the reason why I removed the API.  After all the topmost
> async-aware node must ask every immediate child if it can return a
> tuple.

The patch I posted, which revived Robert’s original patch using stuff
from your patch and Thomas’, provides ExecAsyncRequest() as well as
ExecAsyncNotify(), which supports pull-based execution like
ExecProcNode() (while ExecAsyncNotify() supports push-based
execution.)  In the aggregate case you mentioned, I think we could
iterate calling ExecAsyncRequest() for the underlying subplan to get
all tuples from it, in a similar way to ExecProcNode() in the normal
case.

> EPQ retrieves a specific tuple from a node. If we perform EPQ on an
> Append, only one of the children should offer a result tuple.  Since
> Append has no idea of which of its children will offer a result, it
> has no way other than asking all children until it receives a
> result. If we do that, asynchronously sending a query to all nodes
> would win.

Thanks for the explanation!  But I’m still not sure why we need to
send an asynchronous query to each of the asynchronous nodes in an EPQ
recheck.  Is it possible to explain a bit more about that?

I wrote:
> > That is what I'm thinking to be able to support the case I mentioned
> > above.  I think that that would allow us to find ready subplans
> > efficiently from occurred wait events in ExecAppendAsyncEventWait().
> > Consider a plan like this:
> >
> > Append
> > -> Nested Loop
> >   -> Foreign Scan on a
> >   -> Foreign Scan on b
> > -> ...
> >
> > I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
> > Loop are all async-capable and that we have somewhere in the executor
> > an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
> > Scan on a", an AsyncRequest with requestor="Nested Loop" and
> > requestee="Foreign Scan on b", and an AsyncRequest with
> > requestor="Append" and requestee="Nested Loop".  In
> > ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
> > becomes ready, we would call ForeignAsyncNotify() for a, and if it
> > returns a tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
> > Nested Loop*.  Nested Loop would then call ExecAsyncRequest() for the
> > inner requestee node (ie, Foreign Scan on b; I assume here that it is
> > a foreign scan parameterized by a).  If Foreign Scan on b returns a
> > tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then Nested Loop would match the tuples from the
> > outer and inner sides.  If they match, the join result would be
> > returned back to the requestor node (ie, Append) (using
> > ExecAsyncResponse()), marking the Nested Loop subplan as
> > as_needrequest.  Otherwise, Nested Loop would call ExecAsyncRequest()
> > for the inner requestee node for the next tuple, and so on.  If
> > ExecAsyncRequest() can't return a tuple immediately, we would wait
> > until a file descriptor for foreign table b becomes ready; we would
> > start from calling ForeignAsyncNotify() for b when the file descriptor
> > becomes ready.  In this way we could find ready subplans efficiently
> > from occurred wait events in ExecAppendAsyncEventWait() when extending
> > to the case where subplans are joins or aggregates over Foreign Scans,
> > I think.  Maybe I’m missing something, though.

> Maybe so. As I mentioned above, in the follwoing case..
>
>   Join -1
>     Join -2
>       ForegnScan -A
>       ForegnScan -B
>         ForegnScan -C
>
> Where the Join-1 is the leader of asynchronous fetching. Even if both
> of the FS-A,B have returned one tuple each, it's unsure that Join-2
> returns a tuple. I'm not sure how to resolve the situation with the
> current infrastructure as-is.

Maybe my explanation was not good, so let me explain a bit more.
Assume that Join-2 is a nested loop join as shown above.  If the
tuples from the outer/inner sides didn’t match, we could iterate
calling *ExecAsyncRequest()* for the inner side until a matched tuple
from it is found.  If the inner side wasn’t able to return a tuple
immediately, 1) it would return request_complete=false to Join-2 using
ExecAsyncResponse(), and 2) we could wait for a file descriptor for
the inner side to become ready (while processing other part of the
Append tree), and 3) when the file descriptor becomes ready, recursive
ExecAsyncNotify() calls would restart the Join-2 processing in a
push-based manner as explained above.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK14xrGe%2BXks7%2BfVLBoUUbKwcDkT9km1oFXhdY%2BFFhbMjUg%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> * I haven't yet done anything about the issue on postgres_fdw's
> handling of concurrent data fetches by multiple ForeignScan nodes
> (below *different* Append nodes in the query) using the same
> connection discussed in [2].  I modified the patch to just disable
> applying this feature to problematic test cases in the postgres_fdw
> regression tests, by a new GUC enable_async_append.

A solution for the issue would be a scheduler designed to handle such
data fetches more efficiently, but I don’t think it’s easy to create
such a scheduler.  Rather than doing so, I'd like to propose to allow
FDWs to disable async execution of them in problematic cases by
themselves during executor startup in the first cut.  What I have in
mind for that is:

1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
to record, for each of the async-capable and non-async-capable
ForeignScan(s), the information on a connection to be used for the
ForeignScan into EState during BeginForeignScan().

2) After doing ExecProcNode() to each SubPlan and the main query tree
in InitPlan(), we give the FDW a chance to a) reconsider, for each of
the async-capable ForeignScan(s), whether the ForeignScan can be
executed asynchronously as planned, based on the information stored
into EState in #1, and then b) disable async execution of the
ForeignScan if not.

#1 and #2 would be done after initial partition pruning, so more
async-capable ForeignScans would be executed asynchronously, if other
async-capable ForeignScans conflicting with them are removed by that
pruning.

This wouldn’t prevent us from adding a feature like what was proposed
by Horiguchi-san later.

BTW: while considering this, I noticed some bugs with
ExecAppendAsyncBegin() in the previous patch.  Attached is a new
version of the patch fixing them.  I also tweaked some comments a
little bit.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > * I haven't yet done anything about the issue on postgres_fdw's
> > handling of concurrent data fetches by multiple ForeignScan nodes
> > (below *different* Append nodes in the query) using the same
> > connection discussed in [2].  I modified the patch to just disable
> > applying this feature to problematic test cases in the postgres_fdw
> > regression tests, by a new GUC enable_async_append.
>
> A solution for the issue would be a scheduler designed to handle such
> data fetches more efficiently, but I don’t think it’s easy to create
> such a scheduler.  Rather than doing so, I'd like to propose to allow
> FDWs to disable async execution of them in problematic cases by
> themselves during executor startup in the first cut.  What I have in
> mind for that is:
>
> 1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
> to record, for each of the async-capable and non-async-capable
> ForeignScan(s), the information on a connection to be used for the
> ForeignScan into EState during BeginForeignScan().
>
> 2) After doing ExecProcNode() to each SubPlan and the main query tree
> in InitPlan(), we give the FDW a chance to a) reconsider, for each of
> the async-capable ForeignScan(s), whether the ForeignScan can be
> executed asynchronously as planned, based on the information stored
> into EState in #1, and then b) disable async execution of the
> ForeignScan if not.

s/ExecProcNode()/ExecInitNode()/.  Sorry for that.  I’ll post an
updated patch for this in a few days.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Feb 4, 2021 at 7:21 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Mon, Feb 1, 2021 at 12:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Rather than doing so, I'd like to propose to allow
> > FDWs to disable async execution of them in problematic cases by
> > themselves during executor startup in the first cut.  What I have in
> > mind for that is:
> >
> > 1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
> > to record, for each of the async-capable and non-async-capable
> > ForeignScan(s), the information on a connection to be used for the
> > ForeignScan into EState during BeginForeignScan().
> >
> > 2) After doing ExecProcNode() to each SubPlan and the main query tree
> > in InitPlan(), we give the FDW a chance to a) reconsider, for each of
> > the async-capable ForeignScan(s), whether the ForeignScan can be
> > executed asynchronously as planned, based on the information stored
> > into EState in #1, and then b) disable async execution of the
> > ForeignScan if not.
>
> s/ExecProcNode()/ExecInitNode()/.  Sorry for that.  I’ll post an
> updated patch for this in a few days.

I created a WIP patch for this.  For #2, I added a new callback
routine ReconsiderAsyncForeignScan().  The routine for postgres_fdw
postgresReconsiderAsyncForeignScan() is pretty simple: async execution
of an async-capable ForeignScan is disabled if the connection used for
it is used in other parts of the query plan tree except async subplans
just below the parent Append.  Here is a running example:

postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# analyze pt;
postgres=# create table loct (a int, b int);
postgres=# create foreign table ft (a int, b int) server server1
options (table_name 'loct');
postgres=# insert into ft select i, i from generate_series(0, 99) i;
postgres=# analyze ft;
postgres=# create view v as select * from ft;

postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Nested Loop  (cost=200.00..306.84 rows=2 width=21)
   Output: pt.a, pt.b, pt.c, ft.a, ft.b
   ->  Foreign Scan on public.ft  (cost=100.00..102.27 rows=1 width=8)
         Output: ft.a, ft.b
         Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
   ->  Append  (cost=100.00..204.55 rows=2 width=13)
         ->  Foreign Scan on public.p1 pt_1  (cost=100.00..102.27
rows=1 width=13)
               Output: pt_1.a, pt_1.b, pt_1.c
               Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
         ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
               Output: pt_2.a, pt_2.b, pt_2.c
               Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)

For this query, while p2 is executed asynchronously, p1 isn’t as it
uses the same connection with ft.  BUT:

postgres=# create role view_owner SUPERUSER;
postgres=# create user mapping for view_owner server server1;
postgres=# alter view v owner to view_owner;

postgres=# explain verbose select * from pt, v where pt.b = v.b and v.b = 99;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Nested Loop  (cost=200.00..306.84 rows=2 width=21)
   Output: pt.a, pt.b, pt.c, ft.a, ft.b
   ->  Foreign Scan on public.ft  (cost=100.00..102.27 rows=1 width=8)
         Output: ft.a, ft.b
         Remote SQL: SELECT a, b FROM public.loct WHERE ((b = 99))
   ->  Append  (cost=100.00..204.55 rows=2 width=13)
         ->  Async Foreign Scan on public.p1 pt_1
(cost=100.00..102.27 rows=1 width=13)
               Output: pt_1.a, pt_1.b, pt_1.c
               Remote SQL: SELECT a, b, c FROM public.t1 WHERE ((b = 99))
         ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..102.27 rows=1 width=13)
               Output: pt_2.a, pt_2.b, pt_2.c
               Remote SQL: SELECT a, b, c FROM public.t2 WHERE ((b = 99))
(12 rows)

in this setup, p1 is executed asynchronously as ft doesn’t use the
same connection with p1.

I added to postgresReconsiderAsyncForeignScan() this as well: even if
the connection isn’t used in the other parts, async execution of an
async-capable ForeignScan is disabled if the subplans of the Append
are all async-capable, and they use the same connection, because in
that case the subplans won’t be parallelized at all, and the overhead
of async execution may cause a performance degradation.

Attached is an updated version of the patch.  Sorry for the delay.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is an updated version of the patch.  Sorry for the delay.

I noticed that I forgot to add new files.  :-(.  Please find attached
an updated patch.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Wed, 10 Feb 2021 21:31:15 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Wed, Feb 10, 2021 at 7:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Attached is an updated version of the patch.  Sorry for the delay.
> 
> I noticed that I forgot to add new files.  :-(.  Please find attached
> an updated patch.

Thanks for the new version.

It seems too specific to async Append so I look it as a PoC of the
mechanism.

It creates a hash table that keyed by connection umid to record
planids run on the connection, triggerd by core planner via a dedicate
API function.  It seems to me that ConnCacheEntry.state can hold that
and the hash is not needed at all.

| postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
| {
| ...
|     /*
|      * If the connection used for the ForeignScan node is used in other parts
|      * of the query plan tree except async subplans of the parent Append node,
|      * disable async execution of the ForeignScan node.
|      */
|     if (!bms_is_subset(fsplanids, asyncplanids))
|         return false;

This would be a reasonable restriction.

|     /*
|      * If the subplans of the Append node are all async-capable, and use the
|      * same connection, then we won't execute them asynchronously.
|      */
|     if (requestor->as_nasyncplans == requestor->as_nplans &&
|         !bms_nonempty_difference(asyncplanids, fsplanids))
|         return false;

It is the correct restiction?  I understand that the currently
intending restriction is one connection accepts at most one FDW-scan
node. This looks somethig different...

(Sorry, time's up for now.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Feb 12, 2021 at 5:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> It seems too specific to async Append so I look it as a PoC of the
> mechanism.

Are you saying that the patch only reconsiders async ForeignScans?

> It creates a hash table that keyed by connection umid to record
> planids run on the connection, triggerd by core planner via a dedicate
> API function.  It seems to me that ConnCacheEntry.state can hold that
> and the hash is not needed at all.

I think a good thing about the hash table is that it can be used by
other FDWs that support async execution in a similar way to
postgres_fdw, so they don’t need to create their own hash tables.  But
I’d like to know about the idea of using ConnCacheEntry.  Could you
elaborate a bit more about that?

> | postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext *acxt)
> | {
> | ...
> |     /*
> |        * If the connection used for the ForeignScan node is used in other parts
> |        * of the query plan tree except async subplans of the parent Append node,
> |        * disable async execution of the ForeignScan node.
> |        */
> |       if (!bms_is_subset(fsplanids, asyncplanids))
> |               return false;
>
> This would be a reasonable restriction.

Cool!

> |       /*
> |        * If the subplans of the Append node are all async-capable, and use the
> |        * same connection, then we won't execute them asynchronously.
> |        */
> |       if (requestor->as_nasyncplans == requestor->as_nplans &&
> |               !bms_nonempty_difference(asyncplanids, fsplanids))
> |               return false;
>
> It is the correct restiction?  I understand that the currently
> intending restriction is one connection accepts at most one FDW-scan
> node. This looks somethig different...

People put multiple partitions in a remote PostgreSQL server in
sharding, so the patch allows multiple postgres_fdw ForeignScans
beneath an Append that use the same connection to be executed
asynchronously like this:

postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create table t3 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create foreign table p3 (a int, b int, c text) server
server2 options (table_name 't3');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# alter table pt attach partition p3 for values from (30) to (40);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM0000')
from generate_series(0, 99) i;
postgres=# analyze pt;

postgres=# explain verbose select count(*) from pt;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Aggregate  (cost=314.25..314.26 rows=1 width=8)
   Output: count(*)
   ->  Append  (cost=100.00..313.50 rows=300 width=0)
         ->  Async Foreign Scan on public.p1 pt_1
(cost=100.00..104.00 rows=100 width=0)
               Remote SQL: SELECT NULL FROM public.t1
         ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..104.00 rows=100 width=0)
               Remote SQL: SELECT NULL FROM public.t2
         ->  Async Foreign Scan on public.p3 pt_3
(cost=100.00..104.00 rows=100 width=0)
               Remote SQL: SELECT NULL FROM public.t3
(9 rows)

For this query, p2 and p3, which use the same connection, are scanned
asynchronously!

But if all the subplans of an Append are async postgres_fdw
ForeignScans that use the same connection, they won’t be parallelized
at all, and the overhead of async execution may cause a performance
degradation.  So the patch disables async execution of them in that
case using the above code bit.

Thanks for the review!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Wed, Feb 10, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Please find attached an updated patch.

I noticed that this doesn’t work for cases where ForeignScans are
executed inside functions, and I don’t have any simple solution for
that.  So I’m getting back to what Horiguchi-san proposed for
postgres_fdw to handle concurrent fetches from a remote server
performed by multiple ForeignScan nodes that use the same connection.
As discussed before, we would need to create a scheduler for
performing such fetches in a more optimized way to avoid a performance
degradation in some cases, but that wouldn’t be easy.  Instead, how
about reducing concurrency as an alternative?  In his proposal,
postgres_fdw was modified to perform prefetching pretty aggressively,
so I mean removing aggressive prefetching.  I think we could add it to
postgres_fdw later maybe as the server/table options.  Sorry for the
back and forth.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Sorry that I haven't been able to respond.

At Thu, 18 Feb 2021 11:51:59 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Wed, Feb 10, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Please find attached an updated patch.
> 
> I noticed that this doesn’t work for cases where ForeignScans are
> executed inside functions, and I don’t have any simple solution for

Ah, concurrent fetches in different plan trees?  (For fairness, I
hadn't noticed that case:p) The same can happen when an extension that
is called via hooks.

> that.  So I’m getting back to what Horiguchi-san proposed for
> postgres_fdw to handle concurrent fetches from a remote server
> performed by multiple ForeignScan nodes that use the same connection.
> As discussed before, we would need to create a scheduler for
> performing such fetches in a more optimized way to avoid a performance
> degradation in some cases, but that wouldn’t be easy.  Instead, how

If the "degradation" means degradation caused by repeated creation of
remote cursors, anyway every node on the same connection create its
own connection named as "c<n>" and never "re"created in any case.

If the "degradation" means that my patch needs to wait for the
previous prefetching query to return tuples before sending a new query
(vacate_connection()), it is just moving the wait from just before
sending the new query to just before fetching the next round of the
previous node. The only case it becomes visible degradation is where
the tuples in the next round is not wanted by the upper nodes.

unpatched

nodeA <tuple exhaused>
      <send prefetching FETCH A>
      <return the last tuple of the last round>
nodeB !!<wait for FETCH A returns>
      <send FETCH B>
      !!<wait for FETCH B returns>
      <return tuple just returned>
nodeA <return already fetched tuple>      

patched

nodeA <tuple exhaused>
      <return the last tuple of the last round>
nodeB <send FETCH B>
      !!<wait for FETCH B returns>
      <return the first tuple of the round>
nodeA <send FETCH A>
      !!<wait for FETCH A returns>
      <return the first tuple of the round>

That happens when the upper node stops just after the internal
tuplestore is emptied, and the probability is one in fetch_tuples. (It
is not stochastic so if a query gets suffered by the degradation, it
always suffers unless fetch_tuples is not changed.)  I'm still not
sure that degree of degradaton becomes a show stopper.

> degradation in some cases, but that wouldn’t be easy.  Instead, how
> about reducing concurrency as an alternative?  In his proposal,
> postgres_fdw was modified to perform prefetching pretty aggressively,
> so I mean removing aggressive prefetching.  I think we could add it to
> postgres_fdw later maybe as the server/table options.  Sorry for the
> back and forth.

That was the natural extension from non-aggresive prefetching.
However, maybe we can live without that since if some needs more
speed, it is enought to give every remote tables a dedicate
connection.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Mar 8, 2021 at 2:05 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> There seems to be no objections, so I went ahead and added the
> table/server option ‘async_capable’ set false by default.  Attached is
> an updated patch.

Attached is an updated version of the patch.  Changes are:

* I modified nodeAppend.c a bit further to make the code simpler
(mostly, ExecAppendAsyncBegin() and related code).
* I added a function ExecAsyncRequestPending() to execAsync.c for the
convenience of FDWs.
* I fixed a bug in the definition of WAIT_EVENT_APPEND_READY in pgstat.h.
* I fixed a bug in process_pending_request() in postgres_fdw.c.
* I added comments to executor/README based on Robert’s original patch.
* I added/adjusted/fixed some other comments and docs.
* I think it would be better to keep the existing test cases in
postgres_fdw.sql as-is for testing the existing features, so I
modified it as such, and added new test cases for testing this
feature.
* I rebased the patch against HEAD.

I haven’t yet added docs on FDW APIs.  I think the patch would need a
bit more comments.  But other than that, I feel the patch is in good
shape.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Mar 19, 2021 at 9:57 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> is_async_capable_path() should probably have a "break" for case T_ForeignPath.

Good catch!  Will fix.

> little typos:
> aready
> sigle
> givne
> a event

Lots of typos.  :-(  Will fix.

Thank you for the review!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Mar 19, 2021 at 8:48 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I haven’t yet added docs on FDW APIs.  I think the patch would need a
> bit more comments.

Here is an updated patch.  Changes are:

* Added docs on FDW APIs.
* Added/tweaked some more comments.
* Fixed a bug and typos pointed out by Justin.
* Added an assertion to ExecAppendAsyncBegin().
* Added a bit more regression test cases.
* Rebased the patch against HEAD.

I think the patch would be committable.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Mar 29, 2021 at 6:50 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I think the patch would be committable.

Here is a new version of the patch.

* Rebased the patch against HEAD.
* Tweaked docs/comments a bit further.
* Added the commit message.  Does that make sense?

I'm happy with the patch, so I'll commit it if there are no objections.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Tue, 30 Mar 2021 20:40:35 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Mon, Mar 29, 2021 at 6:50 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > I think the patch would be committable.
> 
> Here is a new version of the patch.
> 
> * Rebased the patch against HEAD.
> * Tweaked docs/comments a bit further.
> * Added the commit message.  Does that make sense?
> 
> I'm happy with the patch, so I'll commit it if there are no objections.

Thanks for the patch.

May I ask some questions?

+     <term><literal>async_capable</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       foreign tables to be scanned concurrently for asynchronous execution.
+       It can be specified for a foreign table or a foreign server.

Isn't it strange that an option named "async_capable" *allows* async?

+         * We'll prefer to consider this join async-capable if any table from
+         * either side of the join is considered async-capable.
+         */
+        fpinfo->async_capable = fpinfo_o->async_capable ||
+            fpinfo_i->async_capable;

We need to explain this behavior in the documentation.

Regarding to the wording "async capable", if it literally represents
the capability to run asynchronously, when any one element of a
combined path doesn't have the capability, the whole path cannot be
async-capable.  If it represents allowance for an element to run
asynchronously, then the whole path is inhibited to run asynchronously
unless all elements are allowed to do so.  If it represents
enforcement or suggestion to run asynchronously, enforcing asynchrony
to an element would lead to running the whole path asynchronously
since all elements of postgres_fdw are capable to run asynchronously
as the nature.

It looks somewhat inconsistent to be inhibitive for the default value
of "async_capable", but agressive in merging?

If I'm wrong in the understanding, please feel free to go ahead.

regrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Wed, Mar 31, 2021 at 10:11 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> +     <term><literal>async_capable</literal></term>
> +     <listitem>
> +      <para>
> +       This option controls whether <filename>postgres_fdw</filename> allows
> +       foreign tables to be scanned concurrently for asynchronous execution.
> +       It can be specified for a foreign table or a foreign server.
>
> Isn't it strange that an option named "async_capable" *allows* async?

I think "async_capable" is a good name for that option.  See the
option "updatable" below in the postgres_fdw documentation.

> +                * We'll prefer to consider this join async-capable if any table from
> +                * either side of the join is considered async-capable.
> +                */
> +               fpinfo->async_capable = fpinfo_o->async_capable ||
> +                       fpinfo_i->async_capable;
>
> We need to explain this behavior in the documentation.
>
> Regarding to the wording "async capable", if it literally represents
> the capability to run asynchronously, when any one element of a
> combined path doesn't have the capability, the whole path cannot be
> async-capable.  If it represents allowance for an element to run
> asynchronously, then the whole path is inhibited to run asynchronously
> unless all elements are allowed to do so.  If it represents
> enforcement or suggestion to run asynchronously, enforcing asynchrony
> to an element would lead to running the whole path asynchronously
> since all elements of postgres_fdw are capable to run asynchronously
> as the nature.
>
> It looks somewhat inconsistent to be inhibitive for the default value
> of "async_capable", but agressive in merging?

If the foreign table has async_capable=true, it actually means that
there are resources (CPU, IO, network, etc.) to scan the foreign table
concurrently.  And if any table from either side of the join has such
resources, then they could also be used for the join.  So I don't
think this behavior is aggressive.  I think it would be better to add
more comments, though.

Anyway, these are all about naming and docs/comments, so I'll return
to this after committing the patch.

Thanks for the review!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I'm happy with the patch, so I'll commit it if there are no objections.

Pushed.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Tom Lane
Date:
Etsuro Fujita <etsuro.fujita@gmail.com> writes:
> Pushed.

The buildfarm points out that this fails under valgrind.
I easily reproduced it here:

==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWait (latch.c:1398)
==00:00:03:42.115 3410499==    by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
==00:00:03:42.115 3410499==    by 0x6BF667: ExecAppendAsyncGetNext (nodeAppend.c:915)
==00:00:03:42.115 3410499==    by 0x6BF667: ExecAppend (nodeAppend.c:337)
==00:00:03:42.115 3410499==    by 0x6D49E4: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499==    by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
==00:00:03:42.115 3410499==    by 0x6A87F2: ExecProcNode (executor.h:257)
==00:00:03:42.115 3410499==    by 0x6A87F2: ExecutePlan (execMain.c:1531)
==00:00:03:42.115 3410499==    by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
==00:00:03:42.115 3410499==    by 0x82597F: ProcessQuery (pquery.c:160)
==00:00:03:42.115 3410499==    by 0x825BE9: PortalRunMulti (pquery.c:1267)
==00:00:03:42.115 3410499==    by 0x826826: PortalRun (pquery.c:779)
==00:00:03:42.115 3410499==    by 0x82291E: exec_simple_query (postgres.c:1185)
==00:00:03:42.115 3410499==    by 0x823F3E: PostgresMain (postgres.c:4415)
==00:00:03:42.115 3410499==    by 0x79BAC1: BackendRun (postmaster.c:4483)
==00:00:03:42.115 3410499==    by 0x79BAC1: BackendStartup (postmaster.c:4205)
==00:00:03:42.115 3410499==    by 0x79BAC1: ServerLoop (postmaster.c:1737)
==00:00:03:42.115 3410499==  Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192
alloc'd
==00:00:03:42.115 3410499==    at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:42.115 3410499==    by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:42.115 3410499==    by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
==00:00:03:42.115 3410499==    by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:42.115 3410499==    by 0x516AE4: untransformRelOptions (reloptions.c:1336)
==00:00:03:42.115 3410499==    by 0x6E6ADF: GetForeignTable (foreign.c:273)
==00:00:03:42.115 3410499==    by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)
==00:00:03:42.115 3410499==    by 0x6C2E83: ExecInitForeignScan (nodeForeignscan.c:236)
==00:00:03:42.115 3410499==    by 0x6AF893: ExecInitNode (execProcnode.c:283)
==00:00:03:42.115 3410499==    by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
==00:00:03:42.115 3410499==    by 0x6AFA37: ExecInitNode (execProcnode.c:180)
==00:00:03:42.115 3410499==    by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)

==00:00:03:44.907 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:44.907 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
==00:00:03:44.907 3410499==    by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
==00:00:03:44.907 3410499==    by 0x7FC903: WaitEventSetWait (latch.c:1398)
==00:00:03:44.907 3410499==    by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
==00:00:03:44.907 3410499==    by 0x6BF718: ExecAppend (nodeAppend.c:370)
==00:00:03:44.907 3410499==    by 0x6D49E4: ExecProcNode (executor.h:257)
==00:00:03:44.907 3410499==    by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
==00:00:03:44.907 3410499==    by 0x6A87F2: ExecProcNode (executor.h:257)
==00:00:03:44.907 3410499==    by 0x6A87F2: ExecutePlan (execMain.c:1531)
==00:00:03:44.907 3410499==    by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
==00:00:03:44.907 3410499==    by 0x82597F: ProcessQuery (pquery.c:160)
==00:00:03:44.907 3410499==    by 0x825BE9: PortalRunMulti (pquery.c:1267)
==00:00:03:44.907 3410499==    by 0x826826: PortalRun (pquery.c:779)
==00:00:03:44.907 3410499==    by 0x82291E: exec_simple_query (postgres.c:1185)
==00:00:03:44.907 3410499==    by 0x823F3E: PostgresMain (postgres.c:4415)
==00:00:03:44.907 3410499==    by 0x79BAC1: BackendRun (postmaster.c:4483)
==00:00:03:44.907 3410499==    by 0x79BAC1: BackendStartup (postmaster.c:4205)
==00:00:03:44.907 3410499==    by 0x79BAC1: ServerLoop (postmaster.c:1737)
==00:00:03:44.907 3410499==  Address 0x1093fdd8 is 2,904 bytes inside a recently re-allocated block of size 16,384
alloc'd
==00:00:03:44.907 3410499==    at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:44.907 3410499==    by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:44.907 3410499==    by 0x958233: palloc (mcxt.c:964)
==00:00:03:44.907 3410499==    by 0x69C400: ExprEvalPushStep (execExpr.c:2310)
==00:00:03:44.907 3410499==    by 0x69C541: ExecPushExprSlots (execExpr.c:2490)
==00:00:03:44.907 3410499==    by 0x69C580: ExecInitExprSlots (execExpr.c:2445)
==00:00:03:44.907 3410499==    by 0x69F0DD: ExecInitQual (execExpr.c:231)
==00:00:03:44.907 3410499==    by 0x6D80EF: ExecInitSeqScan (nodeSeqscan.c:172)
==00:00:03:44.907 3410499==    by 0x6AF9CE: ExecInitNode (execProcnode.c:208)
==00:00:03:44.907 3410499==    by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
==00:00:03:44.907 3410499==    by 0x6AFA37: ExecInitNode (execProcnode.c:180)
==00:00:03:44.907 3410499==    by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)
==00:00:03:44.907 3410499==

Sorta looks like something is relying on a pointer into the relcache
to be valid for longer than it can safely rely on that.  The
CLOBBER_CACHE_ALWAYS animals will probably be unhappy too, but
they are slower than valgrind.

(Note that the test case appears to succeed, you have to notice that
the backend crashed after exiting.)

            regards, tom lane



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The buildfarm points out that this fails under valgrind.
> I easily reproduced it here:

> Sorta looks like something is relying on a pointer into the relcache
> to be valid for longer than it can safely rely on that.  The
> CLOBBER_CACHE_ALWAYS animals will probably be unhappy too, but
> they are slower than valgrind.
>
> (Note that the test case appears to succeed, you have to notice that
> the backend crashed after exiting.)

Will look into this.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The buildfarm points out that this fails under valgrind.
> I easily reproduced it here:
>
> ==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
> ==00:00:03:42.115 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
> ==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
> ==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWait (latch.c:1398)
> ==00:00:03:42.115 3410499==    by 0x6BF46C: ExecAppendAsyncEventWait (nodeAppend.c:1025)
> ==00:00:03:42.115 3410499==    by 0x6BF667: ExecAppendAsyncGetNext (nodeAppend.c:915)
> ==00:00:03:42.115 3410499==    by 0x6BF667: ExecAppend (nodeAppend.c:337)
> ==00:00:03:42.115 3410499==    by 0x6D49E4: ExecProcNode (executor.h:257)
> ==00:00:03:42.115 3410499==    by 0x6D49E4: ExecModifyTable (nodeModifyTable.c:2222)
> ==00:00:03:42.115 3410499==    by 0x6A87F2: ExecProcNode (executor.h:257)
> ==00:00:03:42.115 3410499==    by 0x6A87F2: ExecutePlan (execMain.c:1531)
> ==00:00:03:42.115 3410499==    by 0x6A87F2: standard_ExecutorRun (execMain.c:350)
> ==00:00:03:42.115 3410499==    by 0x82597F: ProcessQuery (pquery.c:160)
> ==00:00:03:42.115 3410499==    by 0x825BE9: PortalRunMulti (pquery.c:1267)
> ==00:00:03:42.115 3410499==    by 0x826826: PortalRun (pquery.c:779)
> ==00:00:03:42.115 3410499==    by 0x82291E: exec_simple_query (postgres.c:1185)
> ==00:00:03:42.115 3410499==    by 0x823F3E: PostgresMain (postgres.c:4415)
> ==00:00:03:42.115 3410499==    by 0x79BAC1: BackendRun (postmaster.c:4483)
> ==00:00:03:42.115 3410499==    by 0x79BAC1: BackendStartup (postmaster.c:4205)
> ==00:00:03:42.115 3410499==    by 0x79BAC1: ServerLoop (postmaster.c:1737)
> ==00:00:03:42.115 3410499==  Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192
alloc'd
> ==00:00:03:42.115 3410499==    at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
> ==00:00:03:42.115 3410499==    by 0x94F9EA: AllocSetAlloc (aset.c:919)
> ==00:00:03:42.115 3410499==    by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
> ==00:00:03:42.115 3410499==    by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
> ==00:00:03:42.115 3410499==    by 0x516AE4: untransformRelOptions (reloptions.c:1336)
> ==00:00:03:42.115 3410499==    by 0x6E6ADF: GetForeignTable (foreign.c:273)
> ==00:00:03:42.115 3410499==    by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)
> ==00:00:03:42.115 3410499==    by 0x6C2E83: ExecInitForeignScan (nodeForeignscan.c:236)
> ==00:00:03:42.115 3410499==    by 0x6AF893: ExecInitNode (execProcnode.c:283)
> ==00:00:03:42.115 3410499==    by 0x6C0007: ExecInitAppend (nodeAppend.c:232)
> ==00:00:03:42.115 3410499==    by 0x6AFA37: ExecInitNode (execProcnode.c:180)
> ==00:00:03:42.115 3410499==    by 0x6D533A: ExecInitModifyTable (nodeModifyTable.c:2575)

The reason for this would be that epoll_wait() is called with
maxevents exceeding the size of the input event array in the test
case.  To fix, I adjusted the parameters to call the caller function
WaitEventSetWait() with in ExecAppendAsyncEventWait().  Patch
attached.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
Thanks for the patch.

At Mon, 5 Apr 2021 17:15:47 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The buildfarm points out that this fails under valgrind.
> > I easily reproduced it here:
> >
> > ==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
> > ==00:00:03:42.115 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
> > ==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
...
> The reason for this would be that epoll_wait() is called with
> maxevents exceeding the size of the input event array in the test
> case.  To fix, I adjusted the parameters to call the caller function

# s/input/output/ event array? (occurrred_events)

# I couldn't reproduce it, so sorry in advance if the following
# discussion is totally bogus..

I have nothing to say if it actually corrects the error, but the only
restriction of maxevents is that it must be positive, and in any case
epoll_wait returns no more than set->nevents events. So I'm a bit
wondering if that's the reason. In the first place I'm wondering if
valgrind is aware of that depth..

==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
==00:00:03:42.115 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
...
==00:00:03:42.115 3410499==  Address 0x10d10628 is 7,960 bytes inside a recently re-allocated block of size 8,192
alloc'd
==00:00:03:42.115 3410499==    at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:03:42.115 3410499==    by 0x94F9EA: AllocSetAlloc (aset.c:919)
==00:00:03:42.115 3410499==    by 0x957BAF: MemoryContextAlloc (mcxt.c:809)
==00:00:03:42.115 3410499==    by 0x958CC0: MemoryContextStrdup (mcxt.c:1179)
==00:00:03:42.115 3410499==    by 0x516AE4: untransformRelOptions (reloptions.c:1336)
==00:00:03:42.115 3410499==    by 0x6E6ADF: GetForeignTable (foreign.c:273)
==00:00:03:42.115 3410499==    by 0xF3BD470: postgresBeginForeignScan (postgres_fdw.c:1479)

As Tom said, this looks like set->epoll_ret_events at the time points
to a palloc'ed memory resided within a realloced chunk.

Valgrind is saying that the variable (WaitEventSet*) set itself is a
valid pointer.  On the other hand set->epoll_ret_events poinst to a
memory chunk that maybe valgrind thinks to have been freed. Since they
are in one allocation block so the pointer alone is broken if valgrind
is right in its complain.

I'm at a loss. How did you cause the error?

> WaitEventSetWait() with in ExecAppendAsyncEventWait().  Patch
> attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Apr 6, 2021 at 12:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Mon, 5 Apr 2021 17:15:47 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > On Fri, Apr 2, 2021 at 12:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > The buildfarm points out that this fails under valgrind.
> > > I easily reproduced it here:
> > >
> > > ==00:00:03:42.115 3410499== Syscall param epoll_wait(events) points to unaddressable byte(s)
> > > ==00:00:03:42.115 3410499==    at 0x58E926B: epoll_wait (epoll_wait.c:30)
> > > ==00:00:03:42.115 3410499==    by 0x7FC903: WaitEventSetWaitBlock (latch.c:1452)
> ...
> > The reason for this would be that epoll_wait() is called with
> > maxevents exceeding the size of the input event array in the test
> > case.  To fix, I adjusted the parameters to call the caller function
>
> # s/input/output/ event array? (occurrred_events)

Sorry, my explanation was not enough.  I think I was in a hurry.  I
mean by "the input event array" the epoll_event array given to
epoll_wait() (i.e., the epoll_ret_events array).

> # I couldn't reproduce it, so sorry in advance if the following
> # discussion is totally bogus..

I produced this failure by running the following simple query in async
mode on a valgrind-enabled build:

select * from ft1 union all select * from ft2

where ft1 and ft2 are postgres_fdw foreign tables.  For this query, we
would call WaitEventSetWait() with nevents=16 in
ExecAppendAsyncEventWait() as EVENT_BUFFER_SIZE=16, and then
epoll_wait() with maxevents=16 in WaitEventSetWaitBlock(); but
maxevents would exceed the input event array as the array size is
three.  I think this inconsitency would cause the valgrind failure.
I'm not 100% sure about that, but the patch fixing this inconsistency
I posted fixed the failure in my environment.

> I have nothing to say if it actually corrects the error, but the only
> restriction of maxevents is that it must be positive, and in any case
> epoll_wait returns no more than set->nevents events. So I'm a bit
> wondering if that's the reason. In the first place I'm wondering if
> valgrind is aware of that depth..

Yeah, the failure might actually be harmless, but anyway, we should
make the buildfarm green.  Also, we should improve the code to avoid
the consistency mentioned above, so I'll apply the patch.

Thanks for the comments!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Apr 6, 2021 at 5:45 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Also, we should improve the code to avoid
> the consistency mentioned above,

Sorry, s/consistency/inconsistency/.

> I'll apply the patch.

Done.  Let's see if this works.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Wed, Mar 31, 2021 at 2:12 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Mar 31, 2021 at 10:11 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > +                * We'll prefer to consider this join async-capable if any table from
> > +                * either side of the join is considered async-capable.
> > +                */
> > +               fpinfo->async_capable = fpinfo_o->async_capable ||
> > +                       fpinfo_i->async_capable;
> >
> > We need to explain this behavior in the documentation.

> > It looks somewhat inconsistent to be inhibitive for the default value
> > of "async_capable", but agressive in merging?
>
> If the foreign table has async_capable=true, it actually means that
> there are resources (CPU, IO, network, etc.) to scan the foreign table
> concurrently.  And if any table from either side of the join has such
> resources, then they could also be used for the join.  So I don't
> think this behavior is aggressive.  I think it would be better to add
> more comments, though.
>
> I'll return to this after committing the patch.

I updated the above comment so that it explains the reason.  Please
find attached a patch.  I did some cleanup as well:

* Simplified code in ExecAppendAsyncEventWait() a little bit to avoid
duplicating the same nevents calculation, and updated comments there.

* Added an assertion to ExecAppendAsyncRequest().

* Updated comments for fetch_more_data_begin().

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Apr 22, 2021 at 12:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

> > > +                * We'll prefer to consider this join async-capable if any table from
> > > +                * either side of the join is considered async-capable.
> > > +                */
> > > +               fpinfo->async_capable = fpinfo_o->async_capable ||
> > > +                       fpinfo_i->async_capable;

> I updated the above comment so that it explains the reason.  Please
> find attached a patch.  I did some cleanup as well:

I have committed the patch.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
> I have committed the patch.
While studying the capabilities of AsyncAppend, I noticed an 
inconsistency with the cost model of the optimizer:

async_capable = off:
--------------------
Append  (cost=100.00..695.00 ...)
    ->  Foreign Scan on f1 part_1  (cost=100.00..213.31 ...)
    ->  Foreign Scan on f2 part_2  (cost=100.00..216.07 ...)
    ->  Foreign Scan on f3 part_3  (cost=100.00..215.62 ...)

async_capable = on:
-------------------
Append  (cost=100.00..695.00 ...)
    ->  Async Foreign Scan on f1 part_1  (cost=100.00..213.31 ...)
    ->  Async Foreign Scan on f2 part_2  (cost=100.00..216.07 ...)
    ->  Async Foreign Scan on f3 part_3  (cost=100.00..215.62 ...)


Here I see two problems:
1. Cost of an AsyncAppend is the same as cost of an Append. But 
execution time of the AsyncAppend for three remote partitions has more 
than halved.
2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.

I haven't ideas why it may be a problem right now. But I can imagine 
that it may be a problem in future if we have alternative paths: complex 
pushdown in synchronous mode (a few rows to return) or simple 
asynchronous append with a large set of rows to return.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
> I have committed the patch.
Small mistake i found. If no tuple was received from a foreign 
partition, explain shows that we never executed node. For example,
if we have 0 tuples in f1 and 100 tuples in f2:

Query:
EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
LIMIT 101;

Explain:
  Limit (actual rows=100 loops=1)
    Output: f1.a
    ->  Append (actual rows=100 loops=1)
          ->  Async Foreign Scan on public.f1 (never executed)
                Output: f1.a
                Remote SQL: SELECT a FROM public.l1
          ->  Async Foreign Scan on public.f2 (actual rows=100 loops=1)
                Output: f2.a
                Remote SQL: SELECT a FROM public.l2

The patch in the attachment fixes this.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
"Andrey V. Lepikhov"
Date:
On 4/23/21 8:12 AM, Etsuro Fujita wrote:
> On Thu, Apr 22, 2021 at 12:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I have committed the patch.

One more question. Append choose async plans at the stage of the Append 
plan creation.
Later, the planner performs some optimizations, such as eliminating 
trivial Subquery nodes. So, AsyncAppend is impossible in some 
situations, for example:

(SELECT * FROM f1 WHERE a < 10)
   UNION ALL
(SELECT * FROM f2 WHERE a < 10);

But works for the query:

SELECT *
   FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;

As far as I understand, this is not a hard limit. We can choose async 
subplans at the beginning of the execution stage.
For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Apr 26, 2021 at 3:01 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> While studying the capabilities of AsyncAppend, I noticed an
> inconsistency with the cost model of the optimizer:

> Here I see two problems:
> 1. Cost of an AsyncAppend is the same as cost of an Append. But
> execution time of the AsyncAppend for three remote partitions has more
> than halved.
> 2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.

Yeah, we don’t adjust the cost for async Append; it’s the same as that
for sync Append.  But I don’t see any issue as-is, either.  (It’s not
that easy to adjust the cost to an appropriate value in the case of
postgres_fdw, because in that case the cost would vary depending on
which connections are used for scanning foreign tables [1].)

> I haven't ideas why it may be a problem right now. But I can imagine
> that it may be a problem in future if we have alternative paths: complex
> pushdown in synchronous mode (a few rows to return) or simple
> asynchronous append with a large set of rows to return.

Yeah, I think it’s better if we could consider async append paths and
estimate the costs for them accurately at path-creation time, not
plan-creation time, because that would make it possible to use async
execution in more cases, as you pointed out.  But I left that for
future work, because I wanted to make the first cut simple.

Thanks for the review!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK15i-OyCesd369P8zyBErjN_T18zVYu27714bf_L%3DCOXew%40mail.gmail.com



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Apr 26, 2021 at 7:35 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Small mistake i found. If no tuple was received from a foreign
> partition, explain shows that we never executed node. For example,
> if we have 0 tuples in f1 and 100 tuples in f2:
>
> Query:
> EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
> SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
> LIMIT 101;
>
> Explain:
>   Limit (actual rows=100 loops=1)
>     Output: f1.a
>     ->  Append (actual rows=100 loops=1)
>           ->  Async Foreign Scan on public.f1 (never executed)
>                 Output: f1.a
>                 Remote SQL: SELECT a FROM public.l1
>           ->  Async Foreign Scan on public.f2 (actual rows=100 loops=1)
>                 Output: f2.a
>                 Remote SQL: SELECT a FROM public.l2
>
> The patch in the attachment fixes this.

Thanks for the report and patch!  Will look into this.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Thu, Mar 4, 2021 at 1:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Another thing I'm concerned about in the postgres_fdw part is the case
> where all/many postgres_fdw ForeignScans of an Append use the same
> connection, because in that case those ForeignScans are executed one
> by one, not in parallel, and hence the overhead of async execution
> (i.e., doing ExecAppendAsyncEventWait()) would merely cause a
> performance degradation.  Here is such an example:
>
> postgres=# create server loopback foreign data wrapper postgres_fdw
> options (dbname 'postgres');
> postgres=# create user mapping for current_user server loopback;
> postgres=# create table pt (a int, b int, c text) partition by range (a);
> postgres=# create table loct1 (a int, b int, c text);
> postgres=# create table loct2 (a int, b int, c text);
> postgres=# create table loct3 (a int, b int, c text);
> postgres=# create foreign table p1 partition of pt for values from
> (10) to (20) server loopback options (table_name 'loct1');
> postgres=# create foreign table p2 partition of pt for values from
> (20) to (30) server loopback options (table_name 'loct2');
> postgres=# create foreign table p3 partition of pt for values from
> (30) to (40) server loopback options (table_name 'loct3');
> postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000')
> from generate_series(0, 99999) i;
> postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000')
> from generate_series(0, 99999) i;
> postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM00000')
> from generate_series(0, 99999) i;
> postgres=# analyze pt;
>
> postgres=# set enable_async_append to off;
> postgres=# select count(*) from pt;
>  count
> --------
>  300000
> (1 row)
>
> Time: 366.905 ms
>
> postgres=# set enable_async_append to on;
> postgres=# select count(*) from pt;
>  count
> --------
>  300000
> (1 row)
>
> Time: 385.431 ms

I think the user should be careful about this.  How about adding a
note about it to the “Asynchronous Execution Options” section in
postgres-fdw.sgml, like the attached?

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Mon, Apr 26, 2021 at 7:35 PM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
> > Small mistake i found. If no tuple was received from a foreign
> > partition, explain shows that we never executed node.

> > The patch in the attachment fixes this.
>
> Will look into this.

The patch fixes the issue, but I don’t think it’s the right way to go,
because it requires an extra ExecProcNode() call, which wouldn’t be
efficient.  Also, the patch wouldn’t address another issue I noticed
in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t
measure the time spent in such nodes accurately.  For the case of
async-capable node using postgres_fdw, it only measures the time spent
in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the
time spent in other things such as creating a cursor in
ExecAsyncRequest().  :-(.  To address both issues, I’d like to propose
the attached, in which I added instrumentation support to
ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify().  I
think this would not only address the reported issue more efficiently,
but allow to collect timing for async-capable nodes more accurately.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> One more question. Append choose async plans at the stage of the Append
> plan creation.
> Later, the planner performs some optimizations, such as eliminating
> trivial Subquery nodes. So, AsyncAppend is impossible in some
> situations, for example:
>
> (SELECT * FROM f1 WHERE a < 10)
>    UNION ALL
> (SELECT * FROM f2 WHERE a < 10);
>
> But works for the query:
>
> SELECT *
>    FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
> WHERE a < 10;
>
> As far as I understand, this is not a hard limit.

I think so, but IMO I think this would be an improvement rather than a bug fix.

> We can choose async
> subplans at the beginning of the execution stage.
> For a demo, I prepared the patch (see in attachment).
> It solves the problem and passes the regression tests.

Thanks for the patch!  IIUC, another approach to this would be the
patch you proposed before [1].  Right?

I didn't have time to look at the patch in [1] for PG14.  My apologies
for that.  Actually, I was planning to return it when the development
for PG15 starts.

Sorry for the late reply.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055%40postgrespro.ru



Re: Asynchronous Append on postgres_fdw nodes.

From
Stephen Frost
Date:
Greetings,

* Etsuro Fujita (etsuro.fujita@gmail.com) wrote:
> On Thu, Mar 4, 2021 at 1:00 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > Another thing I'm concerned about in the postgres_fdw part is the case
> > where all/many postgres_fdw ForeignScans of an Append use the same
> > connection, because in that case those ForeignScans are executed one
> > by one, not in parallel, and hence the overhead of async execution
> > (i.e., doing ExecAppendAsyncEventWait()) would merely cause a
> > performance degradation.  Here is such an example:
> >
> > postgres=# create server loopback foreign data wrapper postgres_fdw
> > options (dbname 'postgres');
> > postgres=# create user mapping for current_user server loopback;
> > postgres=# create table pt (a int, b int, c text) partition by range (a);
> > postgres=# create table loct1 (a int, b int, c text);
> > postgres=# create table loct2 (a int, b int, c text);
> > postgres=# create table loct3 (a int, b int, c text);
> > postgres=# create foreign table p1 partition of pt for values from
> > (10) to (20) server loopback options (table_name 'loct1');
> > postgres=# create foreign table p2 partition of pt for values from
> > (20) to (30) server loopback options (table_name 'loct2');
> > postgres=# create foreign table p3 partition of pt for values from
> > (30) to (40) server loopback options (table_name 'loct3');
> > postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM00000')
> > from generate_series(0, 99999) i;
> > postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM00000')
> > from generate_series(0, 99999) i;
> > postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM00000')
> > from generate_series(0, 99999) i;
> > postgres=# analyze pt;
> >
> > postgres=# set enable_async_append to off;
> > postgres=# select count(*) from pt;
> >  count
> > --------
> >  300000
> > (1 row)
> >
> > Time: 366.905 ms
> >
> > postgres=# set enable_async_append to on;
> > postgres=# select count(*) from pt;
> >  count
> > --------
> >  300000
> > (1 row)
> >
> > Time: 385.431 ms
>
> I think the user should be careful about this.  How about adding a
> note about it to the “Asynchronous Execution Options” section in
> postgres-fdw.sgml, like the attached?

I'd suggest the language point out that it's not actually possible to do
otherwise, since they all need to be part of the same transaction.

Without that, it looks like we're just missing a trick somewhere and
someone might think that they could improve PG to open multiple
connections to the same remote server to execute them in parallel.

Maybe:

In order to ensure that the data being returned from a foreign server
is consistent, postgres_fdw will only open one connection for a given
foreign server and will run all queries against that server sequentially
even if there are multiple foreign tables involved.  In such a case, it
may be more performant to disable this option to eliminate the overhead
associated with running queries asynchronously.

... then again, it'd really be better if we could figure out a way to
just do the right thing here.  I haven't looked at this in depth but I
would think that the overhead of async would be well worth it just about
any time there's more than one foreign server involved.  Is it not
reasonable to have a heuristic where we disable async in the cases where
there's only one foreign server, but have it enabled all the other time?
While continuing to allow users to manage it explicitly if they want.

Thanks,

Stephen

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 6/5/21 22:12, Stephen Frost wrote:
> * Etsuro Fujita (etsuro.fujita@gmail.com) wrote:
>> I think the user should be careful about this.  How about adding a
>> note about it to the “Asynchronous Execution Options” section in
>> postgres-fdw.sgml, like the attached?
+1
> ... then again, it'd really be better if we could figure out a way to
> just do the right thing here.  I haven't looked at this in depth but I
> would think that the overhead of async would be well worth it just about
> any time there's more than one foreign server involved.  Is it not
> reasonable to have a heuristic where we disable async in the cases where
> there's only one foreign server, but have it enabled all the other time?
> While continuing to allow users to manage it explicitly if they want.
Bechmarking of SELECT from foreign partitions hosted on the same server, 
i see results:

With async append:
1 partition - 178 ms; 4 - 263; 8 - 450; 16 - 860; 32 - 1740.

Without:
1 - 178 ms; 4 - 583; 8 - 1140; 16 - 2302; 32 - 4620.

So, these results show that we have a reason to use async append in the 
case where there's only one foreign server.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 6/5/21 11:45, Etsuro Fujita wrote:
> On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> The patch fixes the issue, but I don’t think it’s the right way to go,
> because it requires an extra ExecProcNode() call, which wouldn’t be
> efficient.  Also, the patch wouldn’t address another issue I noticed
> in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t
> measure the time spent in such nodes accurately.  For the case of
> async-capable node using postgres_fdw, it only measures the time spent
> in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the
> time spent in other things such as creating a cursor in
> ExecAsyncRequest().  :-(.  To address both issues, I’d like to propose
> the attached, in which I added instrumentation support to
> ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify().  I
> think this would not only address the reported issue more efficiently,
> but allow to collect timing for async-capable nodes more accurately.

Ok, I agree with the approach, but the next test case failed:

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM (
    (SELECT * FROM f1) UNION ALL (SELECT * FROM f2)
) q1 LIMIT 100;
ERROR:  InstrUpdateTupleCount called on node not yet executed

Initialization script see in attachment.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 6/5/21 14:11, Etsuro Fujita wrote:
> On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> One more question. Append choose async plans at the stage of the Append
>> plan creation.
>> Later, the planner performs some optimizations, such as eliminating
>> trivial Subquery nodes. So, AsyncAppend is impossible in some
>> situations, for example:
>>
>> (SELECT * FROM f1 WHERE a < 10)
>>     UNION ALL
>> (SELECT * FROM f2 WHERE a < 10);
>>
>> But works for the query:
>>
>> SELECT *
>>     FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
>> WHERE a < 10;
>>
>> As far as I understand, this is not a hard limit.
> 
> I think so, but IMO I think this would be an improvement rather than a bug fix.
> 
>> We can choose async
>> subplans at the beginning of the execution stage.
>> For a demo, I prepared the patch (see in attachment).
>> It solves the problem and passes the regression tests.
> 
> Thanks for the patch!  IIUC, another approach to this would be the
> patch you proposed before [1].  Right?
Yes. I think, new solution will be better.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, May 7, 2021 at 2:12 AM Stephen Frost <sfrost@snowman.net> wrote:
> I'd suggest the language point out that it's not actually possible to do
> otherwise, since they all need to be part of the same transaction.
>
> Without that, it looks like we're just missing a trick somewhere and
> someone might think that they could improve PG to open multiple
> connections to the same remote server to execute them in parallel.

Agreed.

> Maybe:
>
> In order to ensure that the data being returned from a foreign server
> is consistent, postgres_fdw will only open one connection for a given
> foreign server and will run all queries against that server sequentially
> even if there are multiple foreign tables involved.  In such a case, it
> may be more performant to disable this option to eliminate the overhead
> associated with running queries asynchronously.

Ok, I’ll merge this into the next version.

Thanks!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, May 7, 2021 at 7:35 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 6/5/21 14:11, Etsuro Fujita wrote:
> > On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov
> > <a.lepikhov@postgrespro.ru> wrote:
> >> One more question. Append choose async plans at the stage of the Append
> >> plan creation.
> >> Later, the planner performs some optimizations, such as eliminating
> >> trivial Subquery nodes. So, AsyncAppend is impossible in some
> >> situations, for example:
> >>
> >> (SELECT * FROM f1 WHERE a < 10)
> >>     UNION ALL
> >> (SELECT * FROM f2 WHERE a < 10);

> >> We can choose async
> >> subplans at the beginning of the execution stage.
> >> For a demo, I prepared the patch (see in attachment).
> >> It solves the problem and passes the regression tests.
> >
> > IIUC, another approach to this would be the
> > patch you proposed before [1].  Right?
> Yes. I think, new solution will be better.

Ok, will review.

I think it would be better to start a new thread for this, and add the
patch to the next CF so that it doesn’t get lost.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, May 7, 2021 at 7:32 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Ok, I agree with the approach, but the next test case failed:
>
> EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
> SELECT * FROM (
>         (SELECT * FROM f1) UNION ALL (SELECT * FROM f2)
> ) q1 LIMIT 100;
> ERROR:  InstrUpdateTupleCount called on node not yet executed
>
> Initialization script see in attachment.

Reproduced.  Here is the EXPLAIN output for the query:

explain verbose select * from ((select * from f1) union all (select *
from f2)) q1 limit 100;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Limit  (cost=100.00..104.70 rows=100 width=4)
   Output: f1.a
   ->  Append  (cost=100.00..724.22 rows=13292 width=4)
         ->  Async Foreign Scan on public.f1  (cost=100.00..325.62
rows=6554 width=4)
               Output: f1.a
               Remote SQL: SELECT a FROM public.l1
         ->  Async Foreign Scan on public.f2  (cost=100.00..332.14
rows=6738 width=4)
               Output: f2.a
               Remote SQL: SELECT a FROM public.l2
(9 rows)

When executing the query “select * from ((select * from f1) union all
(select * from f2)) q1 limit 100” in async mode, the remote queries
for f1 and f2 would be sent to the remote at the same time in the
first ExecAppend().  If the result for the remote query for f1 is
returned first, the local query would be processed using the result,
and the remote query for f2 in progress would be processed during
ExecutorEnd() using process_pending_request() (and vice versa).  But
in the EXPLAIN ANALYZE case, InstrEndLoop() is called *before*
ExecutorEnd(), and it initializes the instr->running flag, so in that
case, when processing the in-progress remote query in
process_pending_request(), we would call InstrUpdateTupleCount() with
the flag unset, causing this error.

I think a simple fix for this would be just remove the check whether
the instr->running flag is set or not in InstrUpdateTupleCount().
Attached is an updated patch, in which I also updated a comment in
execnodes.h and docs in fdwhandler.sgml to match the code in
nodeAppend.c, and fixed typos in comments in nodeAppend.c.

Thanks for the review and script!

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 10/5/21 08:03, Etsuro Fujita wrote:
> On Fri, May 7, 2021 at 7:32 PM Andrey Lepikhov
> I think a simple fix for this would be just remove the check whether
> the instr->running flag is set or not in InstrUpdateTupleCount().
> Attached is an updated patch, in which I also updated a comment in
> execnodes.h and docs in fdwhandler.sgml to match the code in
> nodeAppend.c, and fixed typos in comments in nodeAppend.c.
Your patch fixes the problem. But I found two more problems:

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) 
 
 
SELECT * FROM ( 
 
 
               (SELECT * FROM f1) 
 
 
                              UNION ALL 
 
 
                                     (SELECT * FROM f2) 
 
 
                                                    UNION ALL 
 
 
                                                           (SELECT * 
FROM l3) 
 
                                                                    ) q1 
LIMIT 6709;
                           QUERY PLAN
--------------------------------------------------------------
  Limit (actual rows=6709 loops=1)
    ->  Append (actual rows=6709 loops=1)
          ->  Async Foreign Scan on f1 (actual rows=1 loops=1)
          ->  Async Foreign Scan on f2 (actual rows=1 loops=1)
          ->  Seq Scan on l3 (actual rows=6708 loops=1)

Here we scan 6710 tuples at low level but appended only 6709. Where did 
we lose one tuple?

2.
SELECT * FROM (
    (SELECT * FROM f1) 
 
 
                UNION ALL
    (SELECT * FROM f2) 
 
 
                UNION ALL
    (SELECT * FROM f3 WHERE a > 0)
) q1 LIMIT 3000;
                           QUERY PLAN
--------------------------------------------------------------
  Limit (actual rows=3000 loops=1)
    ->  Append (actual rows=3000 loops=1)
          ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
          ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
          ->  Foreign Scan on f3 (actual rows=3000 loops=1)

Here we give preference to the synchronous scan. Why?

-- 
regards,
Andrey Lepikhov
Postgres Professional



On 7/5/21 21:05, Etsuro Fujita wrote:
> I think it would be better to start a new thread for this, and add the
> patch to the next CF so that it doesn’t get lost.

Current implementation of async append choose asynchronous subplans at 
the phase of an append plan creation. This is safe approach, but we 
loose some optimizations, such of flattening trivial subqueries and 
can't execute some simple queries asynchronously. For example:

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF, COSTS OFF)
(SELECT * FROM f1 WHERE a < 10) UNION ALL
(SELECT * FROM f2 WHERE a < 10);

But, as I could understand, we can choose these subplans later, at the 
init append phase when all optimizations already passed.
In attachment - implementation of the proposed approach.

Initial script for the example see in the parent thread [1].


[1] 
https://www.postgresql.org/message-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3%40postgrespro.ru


-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment


On Mon, May 10, 2021 at 8:45 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 7/5/21 21:05, Etsuro Fujita wrote:
> I think it would be better to start a new thread for this, and add the
> patch to the next CF so that it doesn’t get lost.

Current implementation of async append choose asynchronous subplans at
the phase of an append plan creation. This is safe approach, but we
loose some optimizations, such of flattening trivial subqueries and
can't execute some simple queries asynchronously. For example:

EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF, COSTS OFF)
(SELECT * FROM f1 WHERE a < 10) UNION ALL
(SELECT * FROM f2 WHERE a < 10);

But, as I could understand, we can choose these subplans later, at the
init append phase when all optimizations already passed.
In attachment - implementation of the proposed approach.

Initial script for the example see in the parent thread [1].


[1]
https://www.postgresql.org/message-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3%40postgrespro.ru


--
regards,
Andrey Lepikhov
Postgres Professional
Hi,

+           /* Check to see if subplan can be executed asynchronously */
+           if (subplan->async_capable)
+           {
+               subplan->async_capable = false;

It seems the if statement is not needed: you can directly assign false to  subplan->async_capable.

Cheers
On 11/5/21 08:55, Zhihong Yu wrote:
> +           /* Check to see if subplan can be executed asynchronously */
> +           if (subplan->async_capable)
> +           {
> +               subplan->async_capable = false;
> 
> It seems the if statement is not needed: you can directly assign false 
> to  subplan->async_capable.Thank you, I agree with you.
Close look into the postgres_fdw regression tests show at least one open 
problem with this approach: we need to control situations when only one 
partition doesn't pruned and append isn't exist at all.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, May 11, 2021 at 11:58 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Your patch fixes the problem. But I found two more problems:
>
> EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
> SELECT * FROM (
>         (SELECT * FROM f1)
>                 UNION ALL
>         (SELECT * FROM f2)
>                 UNION ALL
>         (SELECT * FROM l3)
> ) q1 LIMIT 6709;
>                            QUERY PLAN
> --------------------------------------------------------------
>   Limit (actual rows=6709 loops=1)
>     ->  Append (actual rows=6709 loops=1)
>           ->  Async Foreign Scan on f1 (actual rows=1 loops=1)
>           ->  Async Foreign Scan on f2 (actual rows=1 loops=1)
>           ->  Seq Scan on l3 (actual rows=6708 loops=1)
>
> Here we scan 6710 tuples at low level but appended only 6709. Where did
> we lose one tuple?

The extra tuple, which is from f1 or f2, would have been kept in the
Append node's as_asyncresults, not returned from the Append node to
the Limit node.  The async Foreign Scan nodes would fetch tuples
before the Append node ask the tuples, so the fetched tuples may or
may not be used.

> 2.
> SELECT * FROM (
>         (SELECT * FROM f1)
>                 UNION ALL
>         (SELECT * FROM f2)
>                 UNION ALL
>         (SELECT * FROM f3 WHERE a > 0)
> ) q1 LIMIT 3000;
>                            QUERY PLAN
> --------------------------------------------------------------
>   Limit (actual rows=3000 loops=1)
>     ->  Append (actual rows=3000 loops=1)
>           ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
>           ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
>           ->  Foreign Scan on f3 (actual rows=3000 loops=1)
>
> Here we give preference to the synchronous scan. Why?

This would be expected behavior, and the reason is avoid performance
degradation; you might think it would be better to execute the async
Foreign Scan nodes more aggressively, but it would require
waiting/polling for file descriptor events many times, which is
expensive and might cause performance degradation.  I think there is
room for improvement, though.

Thanks!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 11/5/21 12:24, Etsuro Fujita wrote:
> On Tue, May 11, 2021 at 11:58 AM Andrey Lepikhov
> The extra tuple, which is from f1 or f2, would have been kept in the
> Append node's as_asyncresults, not returned from the Append node to
> the Limit node.  The async Foreign Scan nodes would fetch tuples
> before the Append node ask the tuples, so the fetched tuples may or
> may not be used.
Ok.>>      ->  Append (actual rows=3000 loops=1)
>>            ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
>>            ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
>>            ->  Foreign Scan on f3 (actual rows=3000 loops=1)
>>
>> Here we give preference to the synchronous scan. Why?
> 
> This would be expected behavior, and the reason is avoid performance
> degradation; you might think it would be better to execute the async
> Foreign Scan nodes more aggressively, but it would require
> waiting/polling for file descriptor events many times, which is
> expensive and might cause performance degradation.  I think there is
> room for improvement, though.
Yes, I agree with you. Maybe you can add note in documentation on 
async_capable, for example:
"... Synchronous and asynchronous scanning strategies can be mixed by 
optimizer in one scan plan of a partitioned table or an 'UNION ALL' 
command. For performance reasons, synchronous scans executes before the 
first of async scan. ..."

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 11/5/21 12:24, Etsuro Fujita wrote:

> >>      ->  Append (actual rows=3000 loops=1)
> >>            ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
> >>            ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
> >>            ->  Foreign Scan on f3 (actual rows=3000 loops=1)
> >>
> >> Here we give preference to the synchronous scan. Why?
> >
> > This would be expected behavior, and the reason is avoid performance
> > degradation; you might think it would be better to execute the async
> > Foreign Scan nodes more aggressively, but it would require
> > waiting/polling for file descriptor events many times, which is
> > expensive and might cause performance degradation.  I think there is
> > room for improvement, though.
> Yes, I agree with you. Maybe you can add note in documentation on
> async_capable, for example:
> "... Synchronous and asynchronous scanning strategies can be mixed by
> optimizer in one scan plan of a partitioned table or an 'UNION ALL'
> command. For performance reasons, synchronous scans executes before the
> first of async scan. ..."

+1  But I think this is an independent issue, so I think it would be
better to address the issue separately.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
> > On 11/5/21 12:24, Etsuro Fujita wrote:
>
> > >>      ->  Append (actual rows=3000 loops=1)
> > >>            ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
> > >>            ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
> > >>            ->  Foreign Scan on f3 (actual rows=3000 loops=1)
> > >>
> > >> Here we give preference to the synchronous scan. Why?
> > >
> > > This would be expected behavior, and the reason is avoid performance
> > > degradation; you might think it would be better to execute the async
> > > Foreign Scan nodes more aggressively, but it would require
> > > waiting/polling for file descriptor events many times, which is
> > > expensive and might cause performance degradation.  I think there is
> > > room for improvement, though.
> > Yes, I agree with you. Maybe you can add note in documentation on
> > async_capable, for example:
> > "... Synchronous and asynchronous scanning strategies can be mixed by
> > optimizer in one scan plan of a partitioned table or an 'UNION ALL'
> > command. For performance reasons, synchronous scans executes before the
> > first of async scan. ..."
>
> +1  But I think this is an independent issue, so I think it would be
> better to address the issue separately.

I have committed the patch for the original issue.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
I'm resending this because I failed to reply to all.

On Sat, May 8, 2021 at 12:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, May 7, 2021 at 2:12 AM Stephen Frost <sfrost@snowman.net> wrote:
> > In order to ensure that the data being returned from a foreign server
> > is consistent, postgres_fdw will only open one connection for a given
> > foreign server and will run all queries against that server sequentially
> > even if there are multiple foreign tables involved.  In such a case, it
> > may be more performant to disable this option to eliminate the overhead
> > associated with running queries asynchronously.
>
> Ok, I’ll merge this into the next version.

Stephen’s version would be much better than mine, so I updated the
patch as proposed except the first sentence.  If the foreign tables
are subject to different user mappings, multiple connections will be
opened, and queries will be performed in parallel.  So I expanded the
sentence a little bit, to avoid misunderstanding.  Attached is a new
version.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Sun, May 16, 2021 at 11:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Attached is a new version.

I have committed the patch.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > I'm happy with the patch, so I'll commit it if there are no objections.
>
> Pushed.

I noticed that rescan of async Appends is broken when
do_exec_prune=false, leading to incorrect results on normal builds and
the following failure on assertion-enabled builds:

TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
"nodeAppend.c", Line: 1126, PID: 76644)

See a test case for this added in the attached.  The root cause would
be that we call classify_matching_subplans() to re-determine
sync/async subplans when called from the first ExecAppend() after the
first ReScan, even if do_exec_prune=false, which is incorrect because
in that case it is assumed to re-use sync/async subplans determined
during the the first ExecAppend() after Init.  The attached fixes this
issue.  (A previous patch also had this issue, so I fixed it, but I
think I broke this again when simplifying the patch :-(.)  I did a bit
of cleanup, and modified ExecReScanAppend() to initialize an async
state variable as_nasyncresults to zero, to be sure.  I think the
variable would have been set to zero before we get to that function,
so I don't think we really need to do so, though.

I will add this to the open items list for v14.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Kyotaro Horiguchi
Date:
At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in 
> On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > I'm happy with the patch, so I'll commit it if there are no objections.
> >
> > Pushed.
> 
> I noticed that rescan of async Appends is broken when
> do_exec_prune=false, leading to incorrect results on normal builds and
> the following failure on assertion-enabled builds:
> 
> TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
> "nodeAppend.c", Line: 1126, PID: 76644)
> 
> See a test case for this added in the attached.  The root cause would
> be that we call classify_matching_subplans() to re-determine
> sync/async subplans when called from the first ExecAppend() after the
> first ReScan, even if do_exec_prune=false, which is incorrect because
> in that case it is assumed to re-use sync/async subplans determined
> during the the first ExecAppend() after Init.  The attached fixes this
> issue.  (A previous patch also had this issue, so I fixed it, but I
> think I broke this again when simplifying the patch :-(.)  I did a bit
> of cleanup, and modified ExecReScanAppend() to initialize an async
> state variable as_nasyncresults to zero, to be sure.  I think the
> variable would have been set to zero before we get to that function,
> so I don't think we really need to do so, though.
> 
> I will add this to the open items list for v14.

The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.

Otherwise the patch seems to make the code around cleaner.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
Horiguchi-san,

On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita <etsuro.fujita@gmail.com> wrote in
> > The root cause would
> > be that we call classify_matching_subplans() to re-determine
> > sync/async subplans when called from the first ExecAppend() after the
> > first ReScan, even if do_exec_prune=false, which is incorrect because
> > in that case it is assumed to re-use sync/async subplans determined
> > during the the first ExecAppend() after Init.

I noticed I wrote it wrong.  If do_exec_prune=false, we would
determine sync/async subplans during ExecInitAppend(), so the “re-use
sync/async subplans determined during the the first ExecAppend() after
Init" part should be corrected as “re-use sync/async subplans
determined during ExecInitAppend()”.  Sorry for that.

> The patch drops some "= NULL" (initial) initializations when
> nasyncplans == 0. AFAICS makeNode() fills the returned memory with
> zeroes but I'm not sure it is our convention to omit the
> intializations.

I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.

> Otherwise the patch seems to make the code around cleaner.

Thanks for reviewing!

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > The patch drops some "= NULL" (initial) initializations when
> > nasyncplans == 0. AFAICS makeNode() fills the returned memory with
> > zeroes but I'm not sure it is our convention to omit the
> > intializations.
>
> I’m not sure, but I think we omit it in some cases; for example, we
> don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
> do_exec_prune=true.

Ok, I think it would be a good thing to initialize the
pointers/variables to NULL/zero explicitly, so I updated the patch as
such.  Barring objections, I'll get the patch committed in a few days.

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
> > On 11/5/21 12:24, Etsuro Fujita wrote:
>
> > >>      ->  Append (actual rows=3000 loops=1)
> > >>            ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
> > >>            ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
> > >>            ->  Foreign Scan on f3 (actual rows=3000 loops=1)
> > >>
> > >> Here we give preference to the synchronous scan. Why?
> > >
> > > This would be expected behavior, and the reason is avoid performance
> > > degradation; you might think it would be better to execute the async
> > > Foreign Scan nodes more aggressively, but it would require
> > > waiting/polling for file descriptor events many times, which is
> > > expensive and might cause performance degradation.  I think there is
> > > room for improvement, though.
> > Yes, I agree with you. Maybe you can add note in documentation on
> > async_capable, for example:
> > "... Synchronous and asynchronous scanning strategies can be mixed by
> > optimizer in one scan plan of a partitioned table or an 'UNION ALL'
> > command. For performance reasons, synchronous scans executes before the
> > first of async scan. ..."
>
> +1  But I think this is an independent issue, so I think it would be
> better to address the issue separately.

I think that since postgres-fdw.sgml would be for users rather than
developers, unlike fdwhandler.sgml, it would be better to explain this
more in a not-too-technical way.  So how about something like this?

Asynchronous execution is applied even when an Append node contains
subplan(s) executed synchronously as well as subplan(s) executed
asynchronously.  In that case, if the asynchronous subplans are ones
executed using postgres_fdw, tuples from the asynchronous subplans are
not returned until after at least one synchronous subplan returns all
tuples, as that subplan is executed while the asynchronous subplans
are waiting for the results of queries sent to foreign servers.  This
behavior might change in a future release.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Andrey Lepikhov
Date:
On 3/6/21 14:49, Etsuro Fujita wrote:
> On Tue, May 11, 2021 at 6:55 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Tue, May 11, 2021 at 6:27 PM Andrey Lepikhov
>> <a.lepikhov@postgrespro.ru> wrote:
>>> On 11/5/21 12:24, Etsuro Fujita wrote:
>>
>>>>>       ->  Append (actual rows=3000 loops=1)
>>>>>             ->  Async Foreign Scan on f1 (actual rows=0 loops=1)
>>>>>             ->  Async Foreign Scan on f2 (actual rows=0 loops=1)
>>>>>             ->  Foreign Scan on f3 (actual rows=3000 loops=1)
>>>>>
>>>>> Here we give preference to the synchronous scan. Why?
>>>>
>>>> This would be expected behavior, and the reason is avoid performance
>>>> degradation; you might think it would be better to execute the async
>>>> Foreign Scan nodes more aggressively, but it would require
>>>> waiting/polling for file descriptor events many times, which is
>>>> expensive and might cause performance degradation.  I think there is
>>>> room for improvement, though.
>>> Yes, I agree with you. Maybe you can add note in documentation on
>>> async_capable, for example:
>>> "... Synchronous and asynchronous scanning strategies can be mixed by
>>> optimizer in one scan plan of a partitioned table or an 'UNION ALL'
>>> command. For performance reasons, synchronous scans executes before the
>>> first of async scan. ..."
>>
>> +1  But I think this is an independent issue, so I think it would be
>> better to address the issue separately.
> 
> I think that since postgres-fdw.sgml would be for users rather than
> developers, unlike fdwhandler.sgml, it would be better to explain this
> more in a not-too-technical way.  So how about something like this?
> 
> Asynchronous execution is applied even when an Append node contains
> subplan(s) executed synchronously as well as subplan(s) executed
> asynchronously.  In that case, if the asynchronous subplans are ones
> executed using postgres_fdw, tuples from the asynchronous subplans are
> not returned until after at least one synchronous subplan returns all
> tuples, as that subplan is executed while the asynchronous subplans
> are waiting for the results of queries sent to foreign servers.  This
> behavior might change in a future release.
Good, this text is clear for me.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Tue, Jun 1, 2021 at 6:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > The patch drops some "= NULL" (initial) initializations when
> > > nasyncplans == 0. AFAICS makeNode() fills the returned memory with
> > > zeroes but I'm not sure it is our convention to omit the
> > > intializations.
> >
> > I’m not sure, but I think we omit it in some cases; for example, we
> > don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
> > do_exec_prune=true.
>
> Ok, I think it would be a good thing to initialize the
> pointers/variables to NULL/zero explicitly, so I updated the patch as
> such.  Barring objections, I'll get the patch committed in a few days.

I'm replanning to push this early next week for some reason.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Jun 4, 2021 at 7:26 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jun 1, 2021 at 6:30 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Fri, May 28, 2021 at 10:53 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > > > The patch drops some "= NULL" (initial) initializations when
> > > > nasyncplans == 0. AFAICS makeNode() fills the returned memory with
> > > > zeroes but I'm not sure it is our convention to omit the
> > > > intializations.
> > >
> > > I’m not sure, but I think we omit it in some cases; for example, we
> > > don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
> > > do_exec_prune=true.
> >
> > Ok, I think it would be a good thing to initialize the
> > pointers/variables to NULL/zero explicitly, so I updated the patch as
> > such.  Barring objections, I'll get the patch committed in a few days.
>
> I'm replanning to push this early next week for some reason.

Pushed.  I will close this in the open items list for v14.

Best regards,
Etsuro Fujita



Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Fri, Jun 4, 2021 at 12:33 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Good, this text is clear for me.

Cool!  I created a patch for that, which I'm attaching.  I'm planning
to commit the patch.

Thanks for reviewing!

Best regards,
Etsuro Fujita

Attachment

Re: Asynchronous Append on postgres_fdw nodes.

From
Etsuro Fujita
Date:
On Mon, Jun 7, 2021 at 6:36 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I created a patch for that, which I'm attaching.  I'm planning
> to commit the patch.

Done.

Best regards,
Etsuro Fujita



On 11/5/21 06:55, Zhihong Yu wrote:
> On Mon, May 10, 2021 at 8:45 PM Andrey Lepikhov 
> <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote:
> It seems the if statement is not needed: you can directly assign false 
> to  subplan->async_capable.
I have completely rewritten this patch.

Main idea:

The async_capable field of a plan node inform us that this node could 
work in async mode. Each node sets this field based on its own logic.
The actual mode of a node is defined by the async_capable of PlanState 
structure. It is made at the executor initialization stage.
In this patch, only an append node could define async behaviour for its 
subplans.
With such approach the IsForeignPathAsyncCapable routine become 
unecessary, I think.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment
On Wed, Jun 30, 2021 at 1:50 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> I have completely rewritten this patch.
>
> Main idea:
>
> The async_capable field of a plan node inform us that this node could
> work in async mode. Each node sets this field based on its own logic.
> The actual mode of a node is defined by the async_capable of PlanState
> structure. It is made at the executor initialization stage.
> In this patch, only an append node could define async behaviour for its
> subplans.

I finally reviewed the patch.  One thing I noticed about the patch is
that it would break ordered Appends.  Here is such an example using
the patch:

create table pt (a int) partition by range (a);
create table loct1 (a int);
create table loct2 (a int);
create foreign table p1 partition of pt for values from (10) to (20)
server loopback1 options (table_name 'loct1');
create foreign table p2 partition of pt for values from (20) to (30)
server loopback2 options (table_name 'loct2');

explain verbose select * from pt order by a;
                       QUERY PLAN
-------------------------------------------------------------------------------------
 Append  (cost=200.00..440.45 rows=5850 width=4)
   ->  Async Foreign Scan on public.p1 pt_1  (cost=100.00..205.60
rows=2925 width=4)
         Output: pt_1.a
         Remote SQL: SELECT a FROM public.loct1 ORDER BY a ASC NULLS LAST
   ->  Async Foreign Scan on public.p2 pt_2  (cost=100.00..205.60
rows=2925 width=4)
         Output: pt_2.a
         Remote SQL: SELECT a FROM public.loct2 ORDER BY a ASC NULLS LAST
(7 rows)

This would not always provide tuples in the required order, as async
execution would provide them from the subplans rather randomly.  I
think it would not only be too late but be not efficient to do the
planning work at execution time (consider executing generic plans!),
so I think we should avoid doing so.  (The cost of doing that work for
simple foreign scans is small, but if we support async execution for
upper plan nodes such as NestLoop as discussed before, I think the
cost for such plan nodes would not be small anymore.)

To just execute what was planned at execution time, I think we should
return to the patch in [1].  The patch was created for Horiguchi-san’s
async-execution patch, so I modified it to work with HEAD, and added a
simplified version of your test cases.  Please find attached a patch.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru

Attachment

Re: Defer selection of asynchronous subplans until the executor initialization stage

From
"Andrey V. Lepikhov"
Date:
On 8/23/21 2:18 PM, Etsuro Fujita wrote:
> To just execute what was planned at execution time, I think we should
> return to the patch in [1].  The patch was created for Horiguchi-san’s
> async-execution patch, so I modified it to work with HEAD, and added a
> simplified version of your test cases.  Please find attached a patch.
> [1] https://www.postgresql.org/message-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru
I agree, this way is more safe. I tried to search for another approach, 
because here isn't general solution: for each plan node we should 
implement support of asynchronous behaviour.
But for practical use, for small set of nodes, it will work good. I 
haven't any objections for this patch.

-- 
regards,
Andrey Lepikhov
Postgres Professional



On Mon, Aug 30, 2021 at 5:36 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 8/23/21 2:18 PM, Etsuro Fujita wrote:
> > To just execute what was planned at execution time, I think we should
> > return to the patch in [1].  The patch was created for Horiguchi-san’s
> > async-execution patch, so I modified it to work with HEAD, and added a
> > simplified version of your test cases.  Please find attached a patch.

> > [1] https://www.postgresql.org/message-id/7fe10f95-ac6c-c81d-a9d3-227493eb9055@postgrespro.ru

> I agree, this way is more safe. I tried to search for another approach,
> because here isn't general solution: for each plan node we should
> implement support of asynchronous behaviour.

I think so too.

> But for practical use, for small set of nodes, it will work good. I
> haven't any objections for this patch.

OK

To allow async execution in a bit more cases, I modified the patch a
bit further: a ProjectionPath put directly above an async-capable
ForeignPath would also be considered async-capable as ForeignScan can
project and no separate Result is needed in that case, so I modified
mark_async_capable_plan() as such, and added test cases to the
postgres_fdw regression test.  Attached is an updated version of the
patch.

Thanks for the review!

Best regards,
Etsuro Fujita

Attachment

Re: Defer selection of asynchronous subplans until the executor initialization stage

From
Alexander Pyhalov
Date:
Etsuro Fujita писал 2021-08-30 12:52:
> On Mon, Aug 30, 2021 at 5:36 PM Andrey V. Lepikhov
> 
> To allow async execution in a bit more cases, I modified the patch a
> bit further: a ProjectionPath put directly above an async-capable
> ForeignPath would also be considered async-capable as ForeignScan can
> project and no separate Result is needed in that case, so I modified
> mark_async_capable_plan() as such, and added test cases to the
> postgres_fdw regression test.  Attached is an updated version of the
> patch.
> 

Hi.

The patch looks good to me and seems to work as expected.
-- 
Best regards,
Alexander Pyhalov,
Postgres Professional



Hi Alexander,

On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
> Etsuro Fujita писал 2021-08-30 12:52:
> > To allow async execution in a bit more cases, I modified the patch a
> > bit further: a ProjectionPath put directly above an async-capable
> > ForeignPath would also be considered async-capable as ForeignScan can
> > project and no separate Result is needed in that case, so I modified
> > mark_async_capable_plan() as such, and added test cases to the
> > postgres_fdw regression test.  Attached is an updated version of the
> > patch.

> The patch looks good to me and seems to work as expected.

Thanks for reviewing!  I’m planning to commit the patch.

Sorry for the long delay.

Best regards,
Etsuro Fujita



On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
> <a.pyhalov@postgrespro.ru> wrote:
> > The patch looks good to me and seems to work as expected.
>
> I’m planning to commit the patch.

I polished the patch a bit:

* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.

Also, I added the commit message.  Attached is a new version of the
patch.  Barring objections, I’ll commit this.

Best regards,
Etsuro Fujita

Attachment


On Sun, Apr 3, 2022 at 3:28 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
> <a.pyhalov@postgrespro.ru> wrote:
> > The patch looks good to me and seems to work as expected.
>
> I’m planning to commit the patch.

I polished the patch a bit:

* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.

Also, I added the commit message.  Attached is a new version of the
patch.  Barring objections, I’ll commit this.

Best regards,
Etsuro Fujita
Hi,

+   WRITE_ENUM_FIELD(status, SubqueryScanStatus);

Looks like the new field can be named subquerystatus - this way its purpose is clearer.

+ * mark_async_capable_plan
+ *     Check whether a given Path node is async-capable, and if so, mark the
+ *     Plan node created from it as such.

Please add comment explaining what the return value means.

+           if (!IsA(plan, Result) &&
+               mark_async_capable_plan(plan,
+                                       ((ProjectionPath *) path)->subpath))
+               return true;

by returning true, `plan->async_capable = true;` is skipped.
Is that intentional ?

Cheers 
Hi Zhihong,

On Sun, Apr 3, 2022 at 11:38 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> +   WRITE_ENUM_FIELD(status, SubqueryScanStatus);
>
> Looks like the new field can be named subquerystatus - this way its purpose is clearer.

I agree that “status” is too general.  “subquerystatus” might be good,
but I’d like to propose “scanstatus” instead, because I think this
would be consistent with the naming of the RowMarkType-enum member
“markType” in PlanRowMark defined in the same file.

> + * mark_async_capable_plan
> + *     Check whether a given Path node is async-capable, and if so, mark the
> + *     Plan node created from it as such.
>
> Please add comment explaining what the return value means.

Ok, how about something like this?

“Check whether a given Path node is async-capable, and if so, mark the
Plan node created from it as such and return true; otherwise, return
false.”

> +           if (!IsA(plan, Result) &&
> +               mark_async_capable_plan(plan,
> +                                       ((ProjectionPath *) path)->subpath))
> +               return true;
>
> by returning true, `plan->async_capable = true;` is skipped.
> Is that intentional ?

That is intentional; we don’t need to set the async_capable flag
because in that case the flag would already have been set by the above
mark_async_capable_plan().  Note that we pass “plan” to that function.

Thanks for reviewing!

Best regards,
Etsuro Fujita



Re: Defer selection of asynchronous subplans until the executor initialization stage

From
"Andrey V. Lepikhov"
Date:
On 4/3/22 15:29, Etsuro Fujita wrote:
> On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
>> <a.pyhalov@postgrespro.ru> wrote:
>>> The patch looks good to me and seems to work as expected.
>>
>> I’m planning to commit the patch.
> 
> I polished the patch a bit:
> 
> * Reordered a bit of code in create_append_plan() in logical order (no
> functional changes).
> * Added more comments.
> * Added/Tweaked regression test cases.
> 
> Also, I added the commit message.  Attached is a new version of the
> patch.  Barring objections, I’ll commit this.

Sorry for late answer - just vacation.
I looked through this patch - looks much more stable.
But, as far as I remember, on previous version some problems were found 
out on the TPC-H test. I want to play a bit with the TPC-H and with 
parameterized plans.

-- 
regards,
Andrey Lepikhov
Postgres Professional



On Mon, Apr 4, 2022 at 1:06 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Sun, Apr 3, 2022 at 11:38 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> > +   WRITE_ENUM_FIELD(status, SubqueryScanStatus);
> >
> > Looks like the new field can be named subquerystatus - this way its purpose is clearer.
>
> I agree that “status” is too general.  “subquerystatus” might be good,
> but I’d like to propose “scanstatus” instead, because I think this
> would be consistent with the naming of the RowMarkType-enum member
> “markType” in PlanRowMark defined in the same file.
>
> > + * mark_async_capable_plan
> > + *     Check whether a given Path node is async-capable, and if so, mark the
> > + *     Plan node created from it as such.
> >
> > Please add comment explaining what the return value means.
>
> Ok, how about something like this?
>
> “Check whether a given Path node is async-capable, and if so, mark the
> Plan node created from it as such and return true; otherwise, return
> false.”

I have committed the patch after modifying it as such.  (I think we
can improve these later, if necessary.)

Best regards,
Etsuro Fujita



On Mon, Apr 4, 2022 at 6:30 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 4/3/22 15:29, Etsuro Fujita wrote:
> > Also, I added the commit message.  Attached is a new version of the
> > patch.  Barring objections, I’ll commit this.

> I looked through this patch - looks much more stable.
> But, as far as I remember, on previous version some problems were found
> out on the TPC-H test. I want to play a bit with the TPC-H and with
> parameterized plans.

I might be missing something, but I don't see any problems, so I have
committed the patch after some modifications.  If you find them,
please let me know.

Thanks!

Best regards,
Etsuro Fujita



On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> I have committed the patch after modifying it as such.  (I think we
> can improve these later, if necessary.)

This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.

| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
151             return l ? l->length : 0;
(gdb) bt
#0  0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
#1  0x000055b43968af89 in mark_async_capable_plan (plan=plan@entry=0x7f4219ed93b0, path=path@entry=0x7f4219e89538) at
createplan.c:1132
#2  0x000055b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0cb8,
flags=flags@entry=0)at createplan.c:1329
 
#3  0x000055b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0,
best_path=best_path@entry=0x7f4219ed0cb8,flags=flags@entry=0) at createplan.c:421
 
#4  0x000055b43968f974 in create_projection_plan (root=root@entry=0x55b43affb2b0,
best_path=best_path@entry=0x7f4219ed0f60,flags=flags@entry=1) at createplan.c:2039
 
#5  0x000055b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0, best_path=0x7f4219ed0f60,
flags=flags@entry=1)at createplan.c:433
 
#6  0x000055b439690221 in create_plan (root=root@entry=0x55b43affb2b0, best_path=<optimized out>) at createplan.c:348
#7  0x000055b4396a1451 in standard_planner (parse=0x55b43af05e28, query_string=<optimized out>, cursorOptions=2048,
boundParams=0x0)at planner.c:413
 
#8  0x000055b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28, query_string=query_string@entry=0x55b43af04c40
"SELECT1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
 
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at planner.c:277
#9  0x000055b439790c78 in pg_plan_query (querytree=querytree@entry=0x55b43af05e28,
query_string=query_string@entry=0x55b43af04c40"SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 >
pg_trigger_depth();",
 
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:883
#10 0x000055b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528, query_string=query_string@entry=0x55b43af04c40
"SELECT1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
 
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:975
#11 0x000055b439791239 in exec_simple_query (query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
information_schema.constraint_column_usageWHERE 1 > pg_trigger_depth();") at postgres.c:1169
 
#12 0x000055b439793183 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4542
#13 0x000055b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at postmaster.c:4489
#14 0x000055b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0) at postmaster.c:4217
#15 0x000055b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x000055b4396eb401 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1463
#17 0x000055b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202

Actually, the original query failed like this:
#2  0x000055b4398e9f90 in ExceptionalCondition (conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus ==
SUBQUERY_SCAN_UNKNOWN",errorType=errorType@entry=0x55b43994b00b "FailedAssertion", 
 
#3  0x000055b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at setrefs.c:1367





On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> I have committed the patch after modifying it as such.  (I think we
> can improve these later, if necessary.)

This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.

| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
151             return l ? l->length : 0;
(gdb) bt
#0  0x000055b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at ../../../../src/include/nodes/pg_list.h:151
#1  0x000055b43968af89 in mark_async_capable_plan (plan=plan@entry=0x7f4219ed93b0, path=path@entry=0x7f4219e89538) at createplan.c:1132
#2  0x000055b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at createplan.c:1329
#3  0x000055b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at createplan.c:421
#4  0x000055b43968f974 in create_projection_plan (root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:2039
#5  0x000055b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0, best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
#6  0x000055b439690221 in create_plan (root=root@entry=0x55b43affb2b0, best_path=<optimized out>) at createplan.c:348
#7  0x000055b4396a1451 in standard_planner (parse=0x55b43af05e28, query_string=<optimized out>, cursorOptions=2048, boundParams=0x0) at planner.c:413
#8  0x000055b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at planner.c:277
#9  0x000055b439790c78 in pg_plan_query (querytree=querytree@entry=0x55b43af05e28, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:883
#10 0x000055b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528, query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
    cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) at postgres.c:975
#11 0x000055b439791239 in exec_simple_query (query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();") at postgres.c:1169
#12 0x000055b439793183 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4542
#13 0x000055b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at postmaster.c:4489
#14 0x000055b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0) at postmaster.c:4217
#15 0x000055b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x000055b4396eb401 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1463
#17 0x000055b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202

Actually, the original query failed like this:
#2  0x000055b4398e9f90 in ExceptionalCondition (conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus == SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b "FailedAssertion",
#3  0x000055b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at setrefs.c:1367

Hi,
I logged the value of plan->scanstatus before the assertion :

2022-04-08 16:20:59.601 UTC [26325] LOG:  scan status 0
2022-04-08 16:20:59.601 UTC [26325] STATEMENT:  explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
2022-04-08 16:20:59.796 UTC [26296] LOG:  server process (PID 26325) was terminated by signal 11: Segmentation fault 

It seems its value was SUBQUERY_SCAN_UNKNOWN.

Still trying to find out the cause for the crash.
Hi,

On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> This patch seems to be causing the planner to crash.
> Here's a query reduced from sqlsmith.
>
> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
>
> Program terminated with signal SIGSEGV, Segmentation fault.

Reproduced.  Will look into this.

Thanks for the report!

Best regards,
Etsuro Fujita



On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > This patch seems to be causing the planner to crash.
> > Here's a query reduced from sqlsmith.
> >
> > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
> >
> > Program terminated with signal SIGSEGV, Segmentation fault.
>
> Reproduced.  Will look into this.

I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals.  My oversight.  :-(
Attached is a patch for fixing that.  I think v14 has the same issue,
so I think we need backpatching.

Best regards,
Etsuro Fujita

Attachment
Hi,

On Sat, Apr 9, 2022 at 1:24 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> This patch seems to be causing the planner to crash.
>> Here's a query reduced from sqlsmith.
>>
>> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.

> I logged the value of plan->scanstatus before the assertion :
>
> 2022-04-08 16:20:59.601 UTC [26325] LOG:  scan status 0
> 2022-04-08 16:20:59.601 UTC [26325] STATEMENT:  explain SELECT 1 FROM information_schema.constraint_column_usage
WHERE1 <= pg_trigger_depth();
 
> 2022-04-08 16:20:59.796 UTC [26296] LOG:  server process (PID 26325) was terminated by signal 11: Segmentation fault
>
> It seems its value was SUBQUERY_SCAN_UNKNOWN.
>
> Still trying to find out the cause for the crash.

I think the cause is an oversight in mark_async_capable_plan().  See [1].

Thanks!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK15NkuaVo0Fu_0TfoCpPPJaJi4OMLzEQtkE6Bt6YT52fPQ%40mail.gmail.com





On Sun, Apr 10, 2022 at 3:42 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > This patch seems to be causing the planner to crash.
> > Here's a query reduced from sqlsmith.
> >
> > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
> >
> > Program terminated with signal SIGSEGV, Segmentation fault.
>
> Reproduced.  Will look into this.

I think the cause of this is that mark_async_capable_plan() failed to
take into account that when the given path is a SubqueryScanPath or
ForeignPath, the given corresponding plan might include a gating
Result node that evaluates pseudoconstant quals.  My oversight.  :-(
Attached is a patch for fixing that.  I think v14 has the same issue,
so I think we need backpatching.

Best regards,
Etsuro Fujita
Hi,
Looking at the second hunk of the patch:
                FdwRoutine *fdwroutine = path->parent->fdwroutine; 
...
+               if (IsA(plan, Result))
+                   return false;

It seems the check of whether plan is a Result node can be lifted ahead of the switch statement (i.e. to the beginning of mark_async_capable_plan).

This way, we don't have to check for every case in the switch statement.

Cheers
On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote:
> On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > This patch seems to be causing the planner to crash.
> > > Here's a query reduced from sqlsmith.
> > >
> > > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
> > >
> > > Program terminated with signal SIGSEGV, Segmentation fault.
> >
> > Reproduced.  Will look into this.
> 
> I think the cause of this is that mark_async_capable_plan() failed to
> take into account that when the given path is a SubqueryScanPath or
> ForeignPath, the given corresponding plan might include a gating
> Result node that evaluates pseudoconstant quals.  My oversight.  :-(
> Attached is a patch for fixing that.  I think v14 has the same issue,
> so I think we need backpatching.

Thanks - this seems to resolve the issue.

On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
> Looking at the second hunk of the patch:
>                 FdwRoutine *fdwroutine = path->parent->fdwroutine;
> ...
> +               if (IsA(plan, Result))
> +                   return false;
> 
> It seems the check of whether plan is a Result node can be lifted ahead of
> the switch statement (i.e. to the beginning of mark_async_capable_plan).
> 
> This way, we don't have to check for every case in the switch statement.

I think you misread it - the other branch says: if (*not* IsA())

-- 
Justin





On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote:
> On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > This patch seems to be causing the planner to crash.
> > > Here's a query reduced from sqlsmith.
> > >
> > > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
> > >
> > > Program terminated with signal SIGSEGV, Segmentation fault.
> >
> > Reproduced.  Will look into this.
>
> I think the cause of this is that mark_async_capable_plan() failed to
> take into account that when the given path is a SubqueryScanPath or
> ForeignPath, the given corresponding plan might include a gating
> Result node that evaluates pseudoconstant quals.  My oversight.  :-(
> Attached is a patch for fixing that.  I think v14 has the same issue,
> so I think we need backpatching.

Thanks - this seems to resolve the issue.

On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
> Looking at the second hunk of the patch:
>                 FdwRoutine *fdwroutine = path->parent->fdwroutine;
> ...
> +               if (IsA(plan, Result))
> +                   return false;
>
> It seems the check of whether plan is a Result node can be lifted ahead of
> the switch statement (i.e. to the beginning of mark_async_capable_plan).
>
> This way, we don't have to check for every case in the switch statement.

I think you misread it - the other branch says: if (*not* IsA())

No, I didn't misread:

            if (!IsA(plan, Result) &&
                mark_async_capable_plan(plan,
                                        ((ProjectionPath *) path)->subpath))
                return true;
            return false;

If the plan is Result node, false would be returned.
So the check can be lifted to the beginning of the func.

Cheers
On Mon, Apr 11, 2022 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
>> > Looking at the second hunk of the patch:
>> >                 FdwRoutine *fdwroutine = path->parent->fdwroutine;
>> > ...
>> > +               if (IsA(plan, Result))
>> > +                   return false;
>> >
>> > It seems the check of whether plan is a Result node can be lifted ahead of
>> > the switch statement (i.e. to the beginning of mark_async_capable_plan).
>> >
>> > This way, we don't have to check for every case in the switch statement.
>>
>> I think you misread it - the other branch says: if (*not* IsA())
>>
> No, I didn't misread:
>
>             if (!IsA(plan, Result) &&
>                 mark_async_capable_plan(plan,
>                                         ((ProjectionPath *) path)->subpath))
>                 return true;
>             return false;
>
> If the plan is Result node, false would be returned.
> So the check can be lifted to the beginning of the func.

I think we might support more cases in the switch statement in the
future.  My concern about your proposal is that it might make it hard
to add new cases to the statement.  I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.

Thanks for reviewing!

Best regards,
Etsuro Fujita





On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Apr 11, 2022 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote:
>> > Looking at the second hunk of the patch:
>> >                 FdwRoutine *fdwroutine = path->parent->fdwroutine;
>> > ...
>> > +               if (IsA(plan, Result))
>> > +                   return false;
>> >
>> > It seems the check of whether plan is a Result node can be lifted ahead of
>> > the switch statement (i.e. to the beginning of mark_async_capable_plan).
>> >
>> > This way, we don't have to check for every case in the switch statement.
>>
>> I think you misread it - the other branch says: if (*not* IsA())
>>
> No, I didn't misread:
>
>             if (!IsA(plan, Result) &&
>                 mark_async_capable_plan(plan,
>                                         ((ProjectionPath *) path)->subpath))
>                 return true;
>             return false;
>
> If the plan is Result node, false would be returned.
> So the check can be lifted to the beginning of the func.

I think we might support more cases in the switch statement in the
future.  My concern about your proposal is that it might make it hard
to add new cases to the statement.  I agree that what I proposed has a
bit of redundant code, but writing code inside each case independently
would make it easy to add them, making code consistent across branches
and thus making back-patching easy.

Thanks for reviewing!

Best regards,
Etsuro Fujita
Hi,
When a new case arises where the plan is not a Result node, this func can be rewritten.
If there is only one such new case, the check at the beginning of the func can be tuned to exclude that case.

I still think the check should be lifted to the beginning of the func (given the current cases).

Cheers
Hi,

On Sun, Apr 17, 2022 at 7:30 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> I think we might support more cases in the switch statement in the
>> future.  My concern about your proposal is that it might make it hard
>> to add new cases to the statement.  I agree that what I proposed has a
>> bit of redundant code, but writing code inside each case independently
>> would make it easy to add them, making code consistent across branches
>> and thus making back-patching easy.

> When a new case arises where the plan is not a Result node, this func can be rewritten.
> If there is only one such new case, the check at the beginning of the func can be tuned to exclude that case.

Sorry, I don't agree with you.

> I still think the check should be lifted to the beginning of the func (given the current cases).

The given path isn't limited to SubqueryScanPath, ForeignPath and
ProjectionPath, so another concern is extra cycles needed when the
path is other path type that is projection-capable (e.g., Path for
sequential scan, IndexPath, NestPath, ...).  Assume that the given
path is a Path (that doesn't contain pseudoconstant quals).  In that
case the given SeqScan plan node wouldn't contain a gating Result
node, so if we put the if test at the top of the function, we need to
execute not only the test but the switch statement for the given
path/plan nodes.  But if we put the if test inside each case block, we
only need to execute the switch statement, without executing the test.
In the latter case I think we can save cycles for normal cases.

In short: I don't think it's a great idea to put the if test at the
top of the function.

Best regards,
Etsuro Fujita





On Tue, Apr 19, 2022 at 2:01 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
Hi,

On Sun, Apr 17, 2022 at 7:30 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 17, 2022 at 1:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> I think we might support more cases in the switch statement in the
>> future.  My concern about your proposal is that it might make it hard
>> to add new cases to the statement.  I agree that what I proposed has a
>> bit of redundant code, but writing code inside each case independently
>> would make it easy to add them, making code consistent across branches
>> and thus making back-patching easy.

> When a new case arises where the plan is not a Result node, this func can be rewritten.
> If there is only one such new case, the check at the beginning of the func can be tuned to exclude that case.

Sorry, I don't agree with you.

> I still think the check should be lifted to the beginning of the func (given the current cases).

The given path isn't limited to SubqueryScanPath, ForeignPath and
ProjectionPath, so another concern is extra cycles needed when the
path is other path type that is projection-capable (e.g., Path for
sequential scan, IndexPath, NestPath, ...).  Assume that the given
path is a Path (that doesn't contain pseudoconstant quals).  In that
case the given SeqScan plan node wouldn't contain a gating Result
node, so if we put the if test at the top of the function, we need to
execute not only the test but the switch statement for the given
path/plan nodes.  But if we put the if test inside each case block, we
only need to execute the switch statement, without executing the test.
In the latter case I think we can save cycles for normal cases.

In short: I don't think it's a great idea to put the if test at the
top of the function.

Best regards,
Etsuro Fujita
Hi,
It is okay to keep the formation in your patch.

Cheers 
Hi,

On Wed, Apr 20, 2022 at 2:04 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> It is okay to keep the formation in your patch.

I modified mark_async_capable_plan() a bit further; 1) adjusted code
in the ProjectionPath case, just for consistency with other cases, and
2) tweaked/improved comments a bit.  Attached is a new version of the
patch (“prevent-async-2.patch”).

As mentioned before, v14 has the same issue, so I created a fix for
v14, which I’m attaching as well (“prevent-async-2-v14.patch”).  In
the fix I modified is_async_capable_path() the same way as
mark_async_capable_plan() in HEAD, renaming it to
is_async_capable_plan(), and updated some comments.

Barring objections, I’ll push/back-patch these.

Thanks!

Best regards,
Etsuro Fujita

Attachment
On Mon, Apr 25, 2022 at 1:29 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I modified mark_async_capable_plan() a bit further; 1) adjusted code
> in the ProjectionPath case, just for consistency with other cases, and
> 2) tweaked/improved comments a bit.  Attached is a new version of the
> patch (“prevent-async-2.patch”).
>
> As mentioned before, v14 has the same issue, so I created a fix for
> v14, which I’m attaching as well (“prevent-async-2-v14.patch”).  In
> the fix I modified is_async_capable_path() the same way as
> mark_async_capable_plan() in HEAD, renaming it to
> is_async_capable_plan(), and updated some comments.
>
> Barring objections, I’ll push/back-patch these.

Done.

Best regards,
Etsuro Fujita



On Wed, Apr 6, 2022 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I have committed the patch after modifying it as such.

The patch calls trivial_subqueryscan() during create_append_plan() to
determine the triviality of a SubqueryScan that is a child of an
Append node.  Unlike when calling it from
set_subqueryscan_references(), this is done before some
post-processing such as set_plan_references() on the subquery.  The
reason why this is safe wouldn't be that obvious, so I added to
trivial_subqueryscan() comments explaining this.  Attached is a patch
for that.

Best regards,
Etsuro Fujita

Attachment


On Thu, Jun 2, 2022 at 5:08 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Wed, Apr 6, 2022 at 3:58 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> I have committed the patch after modifying it as such.

The patch calls trivial_subqueryscan() during create_append_plan() to
determine the triviality of a SubqueryScan that is a child of an
Append node.  Unlike when calling it from
set_subqueryscan_references(), this is done before some
post-processing such as set_plan_references() on the subquery.  The
reason why this is safe wouldn't be that obvious, so I added to
trivial_subqueryscan() comments explaining this.  Attached is a patch
for that.

Best regards,
Etsuro Fujita
Hi,
Suggestion on formatting the comment:

+ * node (or that for any plan node in the subplan tree), 2)
+ * set_plan_references() modifies the tlist for every plan node in the

It would be more readable if `2)` is put at the beginning of the second line above.

+ * preserves the length and order of the tlist, and 3) set_plan_references()
+ * might delete the topmost plan node like an Append or MergeAppend from the

Similarly you can move `3) set_plan_references()` to the beginning of the next line.

Cheers
On Fri, Jun 3, 2022 at 1:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> Suggestion on formatting the comment:
>
> + * node (or that for any plan node in the subplan tree), 2)
> + * set_plan_references() modifies the tlist for every plan node in the
>
> It would be more readable if `2)` is put at the beginning of the second line above.
>
> + * preserves the length and order of the tlist, and 3) set_plan_references()
> + * might delete the topmost plan node like an Append or MergeAppend from the
>
> Similarly you can move `3) set_plan_references()` to the beginning of the next line.

Seems like a good idea, so I updated the patch as you suggest.  I did
some indentation as well, which I think improves readability a bit
further.  Attached is an updated version.  If no objections, I’ll
commit the patch.

Thanks for reviewing!

Best regards,
Etsuro Fujita

Attachment
On Wed, Jun 8, 2022 at 7:18 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Fri, Jun 3, 2022 at 1:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> > Suggestion on formatting the comment:
> >
> > + * node (or that for any plan node in the subplan tree), 2)
> > + * set_plan_references() modifies the tlist for every plan node in the
> >
> > It would be more readable if `2)` is put at the beginning of the second line above.
> >
> > + * preserves the length and order of the tlist, and 3) set_plan_references()
> > + * might delete the topmost plan node like an Append or MergeAppend from the
> >
> > Similarly you can move `3) set_plan_references()` to the beginning of the next line.
>
> Seems like a good idea, so I updated the patch as you suggest.  I did
> some indentation as well, which I think improves readability a bit
> further.  Attached is an updated version.  If no objections, I’ll
> commit the patch.

Done.

Best regards,
Etsuro Fujita