Re: Asynchronous Append on postgres_fdw nodes. - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Asynchronous Append on postgres_fdw nodes. |
Date | |
Msg-id | 20200305.174424.937266824649246262.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: Asynchronous Append on postgres_fdw nodes. (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Asynchronous Append on postgres_fdw nodes.
|
List | pgsql-hackers |
Thank you for the comment. At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch > > The async feature uses WaitEvent, and it needs to be released on > > error. This patch makes it possible to register WaitEvent to > > resowner to handle that case.. > > +1 > > > - v2-0002-infrastructure-for-asynchronous-execution.patch > > It povides an abstraction layer of asynchronous behavior > > (execAsync). Then adds ExecAppend, another version of ExecAppend, > > that handles "async-capable" subnodes asynchronously. Also it > > contains planner part that makes planner aware of "async-capable" > > and "async-aware" path nodes. > > > This patch add an infrastructure for asynchronous execution. As a PoC > > this makes only Append capable to handle asynchronously executable > > subnodes. > > What other nodes do you think could be async aware? I suppose you > would teach joins to pass through the async support of their children, > and then you could make partition-wise join work like that. An Append node is fed from many immediate-child async-capable nodes, so the Apeend node can pick any child node that have fired. Unfortunately joins are not wide but deep. That means a set of async-capable nodes have multiple async-aware (and async capable at the same time for intermediate nodes) parent nodes. So if we want to be async on that configuration, we need "push-up" executor engine. In my last trial, ignoring performane, I could turn almost all nodes into push-up style but a few nodes, like WindowAgg or HashJoin have a quite low affinity with push-up style since the caller sites to child nodes are many and scattered. I got through the low-affinity by turning the nodes into state machines, but I don't think it is good. > + /* choose appropriate version of Exec function */ > + if (appendstate->as_nasyncplans == 0) > + appendstate->ps.ExecProcNode = ExecAppend; > + else > + appendstate->ps.ExecProcNode = ExecAppendAsync; > > Cool. No extra cost for people not using the new feature. It creates some duplicate code but I agree on the performance perspective. > + slot = ExecProcNode(subnode); > + if (subnode->asyncstate == AS_AVAILABLE) > > So, now when you execute a node, you get a result AND you get some > information that you access by reaching into the child node's > PlanState. The ExecProcNode() interface is extremely limiting, but > I'm not sure if this is the right way to extend it. Maybe > ExecAsyncProcNode() with a wide enough interface to do the job? Sounds resonable and seems easy to do. > +Bitmapset * > +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) > +{ > + static int *refind = NULL; > + static int refindsize = 0; > ... > + if (refindsize < n) > ... > + static ExecAsync_mcbarg mcb_arg = > + { &refind, &refindsize }; > + static MemoryContextCallback mcb = > + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; > ... > + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb); > > This seems a bit strange. Why not just put the pointer in the plan > state? I suppose you want to avoid allocating a new buffer for every > query. Perhaps you could fix that by having a small fixed-size buffer > in the PlanState to cover common cases and allocating a larger one in > a per-query memory context if that one is too small, or just not > worrying about it and allocating every time since you know the desired > size. The most significant factor for the shape would be ExecAsync is not a kind of ExecNode. So ExecAsyncEventWait doen't have direcgt access to EState other than one of given mutiple nodes. I consider tryig to use given ExecNodes as an access path to ESttate. > + wes = CreateWaitEventSet(TopTransactionContext, > TopTransactionResourceOwner, n); > ... > + FreeWaitEventSet(wes); > > BTW, just as an FYI, I am proposing[1] to add support for > RemoveWaitEvent(), so that you could have a single WaitEventSet for > the lifetime of the executor node, and then add and remove sockets > only as needed. I'm hoping to commit that for PG13, if there are no > objections or better ideas soon, because it's useful for some other > places where we currently create and destroy WaitEventSets frequently. Yes! I have wanted that (but haven't done by myself..., and I didn't understand the details from the title "Reducint WaitEventSet syscall churn":p) > One complication when working with long-lived WaitEventSet objects is > that libpq (or some other thing used by some other hypothetical > async-capable FDW) is free to close and reopen its socket whenever it > wants, so you need a way to know when it has done that. In that patch > set I added pqSocketChangeCount() so that you can see when pgSocket() > refers to a new socket (even if the file descriptor number is the same > by coincidence), but that imposes some book-keeping duties on the > caller, and now I'm wondering how that would look in your patch set. As for postgres-fdw, unsponaneous disconnection immedately leands to query ERROR. > My goal is to generate the minimum number of systems calls. I think > it would be nice if a 1000-shard query only calls epoll_ctl() when a > child node needs to be added or removed from the set, not > epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every > wait. But I suppose there is an argument that it's more complication > than it's worth. > > [1] https://commitfest.postgresql.org/27/2452/ I'm not sure how it gives performance gain, but reducing syscalls itself is good. I'll look on it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: