Re: Asynchronous Append on postgres_fdw nodes. - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Asynchronous Append on postgres_fdw nodes.
Date
Msg-id 20200305.174424.937266824649246262.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Asynchronous Append on postgres_fdw nodes.  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Asynchronous Append on postgres_fdw nodes.
List pgsql-hackers
Thank you for the comment.

At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
> >   The async feature uses WaitEvent, and it needs to be released on
> >   error.  This patch makes it possible to register WaitEvent to
> >   resowner to handle that case..
> 
> +1
> 
> > - v2-0002-infrastructure-for-asynchronous-execution.patch
> >   It povides an abstraction layer of asynchronous behavior
> >   (execAsync). Then adds ExecAppend, another version of ExecAppend,
> >   that handles "async-capable" subnodes asynchronously. Also it
> >   contains planner part that makes planner aware of "async-capable"
> >   and "async-aware" path nodes.
> 
> > This patch add an infrastructure for asynchronous execution. As a PoC
> > this makes only Append capable to handle asynchronously executable
> > subnodes.
> 
> What other nodes do you think could be async aware?  I suppose you
> would teach joins to pass through the async support of their children,
> and then you could make partition-wise join work like that.

An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.

Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered.  I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.

> +    /* choose appropriate version of Exec function */
> +    if (appendstate->as_nasyncplans == 0)
> +        appendstate->ps.ExecProcNode = ExecAppend;
> +    else
> +        appendstate->ps.ExecProcNode = ExecAppendAsync;
> 
> Cool.  No extra cost for people not using the new feature.

It creates some duplicate code but I agree on the performance
perspective.

> +        slot = ExecProcNode(subnode);
> +        if (subnode->asyncstate == AS_AVAILABLE)
> 
> So, now when you execute a node, you get a result AND you get some
> information that you access by reaching into the child node's
> PlanState.  The ExecProcNode() interface is extremely limiting, but
> I'm not sure if this is the right way to extend it.  Maybe
> ExecAsyncProcNode() with a wide enough interface to do the job?

Sounds resonable and seems easy to do.

> +Bitmapset *
> +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
> +{
> +    static int *refind = NULL;
> +    static int refindsize = 0;
> ...
> +    if (refindsize < n)
> ...
> +            static ExecAsync_mcbarg mcb_arg =
> +                { &refind, &refindsize };
> +            static MemoryContextCallback mcb =
> +                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
> ...
> +            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
> 
> This seems a bit strange.  Why not just put the pointer in the plan
> state?  I suppose you want to avoid allocating a new buffer for every
> query.  Perhaps you could fix that by having a small fixed-size buffer
> in the PlanState to cover common cases and allocating a larger one in
> a per-query memory context if that one is too small, or just not
> worrying about it and allocating every time since you know the desired
> size.

The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode.  So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.


> +    wes = CreateWaitEventSet(TopTransactionContext,
> TopTransactionResourceOwner, n);
> ...
> +    FreeWaitEventSet(wes);
> 
> BTW, just as an FYI, I am proposing[1] to add support for
> RemoveWaitEvent(), so that you could have a single WaitEventSet for
> the lifetime of the executor node, and then add and remove sockets
> only as needed.  I'm hoping to commit that for PG13, if there are no
> objections or better ideas soon, because it's useful for some other
> places where we currently create and destroy WaitEventSets frequently.

Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)

> One complication when working with long-lived WaitEventSet objects is
> that libpq (or some other thing used by some other hypothetical
> async-capable FDW) is free to close and reopen its socket whenever it
> wants, so you need a way to know when it has done that.  In that patch
> set I added pqSocketChangeCount() so that you can see when pgSocket()
> refers to a new socket (even if the file descriptor number is the same
> by coincidence), but that imposes some book-keeping duties on the
> caller, and now I'm wondering how that would look in your patch set.

As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.

> My goal is to generate the minimum number of systems calls.  I think
> it would be nice if a 1000-shard query only calls epoll_ctl() when a
> child node needs to be added or removed from the set, not
> epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close()  for every
> wait.  But I suppose there is an argument that it's more complication
> than it's worth.
> 
> [1] https://commitfest.postgresql.org/27/2452/

I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: PG_COLOR not mentioned in docs of vacuumlo, oid2name and pgbench
Next
From: Amit Kapila
Date:
Subject: Re: logical replication empty transactions