Re: Asynchronous Append on postgres_fdw nodes. - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Asynchronous Append on postgres_fdw nodes.
Date
Msg-id CA+hUKG+R75EnQqMXdxRDRxehC7VYRmqsg03ZrvUyomV3=S11KQ@mail.gmail.com
Whole thread Raw
In response to Asynchronous Append on postgres_fdw nodes.  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Asynchronous Append on postgres_fdw nodes.
List pgsql-hackers
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
>   The async feature uses WaitEvent, and it needs to be released on
>   error.  This patch makes it possible to register WaitEvent to
>   resowner to handle that case..

+1

> - v2-0002-infrastructure-for-asynchronous-execution.patch
>   It povides an abstraction layer of asynchronous behavior
>   (execAsync). Then adds ExecAppend, another version of ExecAppend,
>   that handles "async-capable" subnodes asynchronously. Also it
>   contains planner part that makes planner aware of "async-capable"
>   and "async-aware" path nodes.

> This patch add an infrastructure for asynchronous execution. As a PoC
> this makes only Append capable to handle asynchronously executable
> subnodes.

What other nodes do you think could be async aware?  I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.

+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;

Cool.  No extra cost for people not using the new feature.

+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)

So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState.  The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it.  Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?

+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
...
+    if (refindsize < n)
...
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);

This seems a bit strange.  Why not just put the pointer in the plan
state?  I suppose you want to avoid allocating a new buffer for every
query.  Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.

+    wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+    FreeWaitEventSet(wes);

BTW, just as an FYI, I am proposing[1] to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed.  I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that.  In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
My goal is to generate the minimum number of systems calls.  I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close()  for every
wait.  But I suppose there is an argument that it's more complication
than it's worth.

[1] https://commitfest.postgresql.org/27/2452/



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Crash by targetted recovery
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Identifying user-created objects