Thread: Re: asynchronous execution
[ Adjusting subject line to reflect the actual topic of discussion better. ] On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> For e.g., in the above plan which you specified, suppose : >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so >> is >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b). >> 2. The event wait list already has foreign scan on a that is on a different >> subtree. >> 3. This foreign scan a happens to be ready, so in >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called, >> which returns with result_ready. >> 4. Since it returns result_ready, it's parent node is now inserted in the >> callbacks array, and so it's parent (Append) is executed. >> 5. But, this Append planstate is already in the middle of executing Hash >> join, and is waiting for HashJoin. > > Ah, yeah, something like that could happen. I've spent much of this > week working on a new design for this feature which I think will avoid > this problem. It doesn't work yet - in fact I can't even really test > it yet. But I'll post what I've got by the end of the day today so > that anyone who is interested can look at it and critique. Well, I promised to post this, so here it is. It's not really working all that well at this point, and it's definitely not doing anything that interesting, but you can see the outline of what I have in mind. Since Kyotaro Horiguchi found that my previous design had a system-wide performance impact due to the ExecProcNode changes, I decided to take a different approach here: I created an async infrastructure where both the requestor and the requestee have to be specifically modified to support parallelism, and then modified Append and ForeignScan to cooperate using the new interface. Hopefully that means that anything other than those two nodes will suffer no performance impact. Of course, it might have other problems.... Some notes: - EvalPlanQual rechecks are broken. - EXPLAIN ANALYZE instrumentation is broken. - ExecReScanAppend is broken, because the async stuff needs some way of canceling an async request and I didn't invent anything like that yet. - The postgres_fdw changes pretend to be async but aren't actually. It's just a demo of (part of) the interface at this point. - The postgres_fdw changes also report all pg-fdw paths as async-capable, but actually the direct-modify ones aren't, so the regression tests fail. - Errors in the executor can leak the WaitEventSet. Probably we need to modify ResourceOwners to be able to own WaitEventSets. - There are probably other bugs, too. Whee! Note that I've tried to solve the re-entrancy problems by (1) putting all of the event loop's state inside the EState rather than in local variables and (2) having the function that is called to report arrival of a result be thoroughly different than the function that is used to return a tuple to a synchronous caller. Comments welcome, if you're feeling brave enough to look at anything this half-baked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote: > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems.... I see that the reason why you re-designed the asynchronous execution implementation is because the earlier implementation showed performance degradation in local sequential and local parallel scans. But I checked that the ExecProcNode() changes were not that significant as to cause the degradation. It will not call ExecAsyncWaitForNode() unless that node supports asynchronism. Do you feel there is anywhere else in the implementation that is really causing this degrade ? That previous implementation has some issues, but they seemed solvable. We could resolve the plan state recursion issue by explicitly making sure the same plan state does not get called again while it is already executing. Thanks -Amit Khandekar
Sorry for delayed response, I'll have enough time from now and address this. At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com> > Well, I promised to post this, so here it is. It's not really working > all that well at this point, and it's definitely not doing anything > that interesting, but you can see the outline of what I have in mind. > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems.... > > Some notes: > > - EvalPlanQual rechecks are broken. > - EXPLAIN ANALYZE instrumentation is broken. > - ExecReScanAppend is broken, because the async stuff needs some way > of canceling an async request and I didn't invent anything like that > yet. > - The postgres_fdw changes pretend to be async but aren't actually. > It's just a demo of (part of) the interface at this point. > - The postgres_fdw changes also report all pg-fdw paths as > async-capable, but actually the direct-modify ones aren't, so the > regression tests fail. > - Errors in the executor can leak the WaitEventSet. Probably we need > to modify ResourceOwners to be able to own WaitEventSets. > - There are probably other bugs, too. > > Whee! > > Note that I've tried to solve the re-entrancy problems by (1) putting > all of the event loop's state inside the EState rather than in local > variables and (2) having the function that is called to report arrival > of a result be thoroughly different than the function that is used to > return a tuple to a synchronous caller. > > Comments welcome, if you're feeling brave enough to look at anything > this half-baked. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, thank you for the comment. At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9fRmEhUoBMnNN8K_QrHZf7m4rmOHTFDj492oeLZff8o=w@mail.gmail.com> > On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote: > > Since Kyotaro Horiguchi found that my previous design had a > > system-wide performance impact due to the ExecProcNode changes, I > > decided to take a different approach here: I created an async > > infrastructure where both the requestor and the requestee have to be > > specifically modified to support parallelism, and then modified Append > > and ForeignScan to cooperate using the new interface. Hopefully that > > means that anything other than those two nodes will suffer no > > performance impact. Of course, it might have other problems.... > > I see that the reason why you re-designed the asynchronous execution > implementation is because the earlier implementation showed > performance degradation in local sequential and local parallel scans. > But I checked that the ExecProcNode() changes were not that > significant as to cause the degradation. The basic thought is that we don't allow degradation of as small as around one percent for simple cases in exchange for this feature (or similar ones). Very simple case of SeqScan runs through a very short path, on where prediction failure penalties of CPU by few branches results in visible impact. I avoided that by using likely/unlikly but more fundamental measure is preferable. > It will not call ExecAsyncWaitForNode() unless that node > supports asynchronism. That's true, but it takes a certain amount of CPU cycle to decide call it or not. The small bit of time is the issue in focus now. > Do you feel there is anywhere else in > the implementation that is really causing this degrade ? That > previous implementation has some issues, but they seemed > solvable. We could resolve the plan state recursion issue by > explicitly making sure the same plan state does not get called > again while it is already executing. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for the thought. At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com> > [ Adjusting subject line to reflect the actual topic of discussion better. ] > > On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > >> For e.g., in the above plan which you specified, suppose : > >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so > >> is > >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b). > >> 2. The event wait list already has foreign scan on a that is on a different > >> subtree. > >> 3. This foreign scan a happens to be ready, so in > >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called, > >> which returns with result_ready. > >> 4. Since it returns result_ready, it's parent node is now inserted in the > >> callbacks array, and so it's parent (Append) is executed. > >> 5. But, this Append planstate is already in the middle of executing Hash > >> join, and is waiting for HashJoin. > > > > Ah, yeah, something like that could happen. I've spent much of this > > week working on a new design for this feature which I think will avoid > > this problem. It doesn't work yet - in fact I can't even really test > > it yet. But I'll post what I've got by the end of the day today so > > that anyone who is interested can look at it and critique. > > Well, I promised to post this, so here it is. It's not really working > all that well at this point, and it's definitely not doing anything > that interesting, but you can see the outline of what I have in mind. > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems.... The previous framework didn't need to distinguish async-capable and uncapable nodes from the parent node's view. The things in ExecProcNode was required for the reason. Instead, this new one removes the ExecProcNode stuff by distinguishing the two kinds of node in async-aware parents, that is, Append. This no longer involves async-unaware nodes into the tuple bubbling-up mechanism so the reentrant problem doesn't seem to occur. On the other hand, for example, the following plan, regrardless of its practicality, (there should be more good example..) (Async-unaware node)- NestLoop - Append - n * ForegnScan - Append - n * ForegnScan If the NestLoop, Append are async-aware, all of the ForeignScans can run asynchronously with the previous framework. The topmost NestLoop will be awakened after that firing of any ForenScans makes a tuple bubbles up to the NestLoop. This is because the not-need-to-distinguish-aware-or-not nature provided by the ExecProcNode stuff. On the other hand, with the new one, in order to do the same thing, ExecAppend have in turn to behave differently whether the parent is async or not. To do this will be bothersome but not with confidence. I examine this further intensively, especially for performance degeneration and obstacles to complete this. > Some notes: > > - EvalPlanQual rechecks are broken. > - EXPLAIN ANALYZE instrumentation is broken. > - ExecReScanAppend is broken, because the async stuff needs some way > of canceling an async request and I didn't invent anything like that > yet. > - The postgres_fdw changes pretend to be async but aren't actually. > It's just a demo of (part of) the interface at this point. > - The postgres_fdw changes also report all pg-fdw paths as > async-capable, but actually the direct-modify ones aren't, so the > regression tests fail. > - Errors in the executor can leak the WaitEventSet. Probably we need > to modify ResourceOwners to be able to own WaitEventSets. > - There are probably other bugs, too. > > Whee! > > Note that I've tried to solve the re-entrancy problems by (1) putting > all of the event loop's state inside the EState rather than in local > variables and (2) having the function that is called to report arrival > of a result be thoroughly different than the function that is used to > return a tuple to a synchronous caller. > > Comments welcome, if you're feeling brave enough to look at anything > this half-baked. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote: >> Since Kyotaro Horiguchi found that my previous design had a >> system-wide performance impact due to the ExecProcNode changes, I >> decided to take a different approach here: I created an async >> infrastructure where both the requestor and the requestee have to be >> specifically modified to support parallelism, and then modified Append >> and ForeignScan to cooperate using the new interface. Hopefully that >> means that anything other than those two nodes will suffer no >> performance impact. Of course, it might have other problems.... > > I see that the reason why you re-designed the asynchronous execution > implementation is because the earlier implementation showed > performance degradation in local sequential and local parallel scans. > But I checked that the ExecProcNode() changes were not that > significant as to cause the degradation. I think we need some testing to prove that one way or the other. If you can do some - say on a plan with multiple nested loop joins with inner index-scans, which will call ExecProcNode() a lot - that would be great. I don't think we can just rely on "it doesn't seem like it should be slower", though - ExecProcNode() is too important a function for us to guess at what the performance will be. The thing I'm really worried about with either implementation is what happens when we start to add asynchronous capability to multiple nodes. For example, if you imagine a plan like this: Append -> Hash Join -> Foreign Scan -> Hash -> Seq Scan -> Hash Join -> Foreign Scan -> Hash -> Seq Scan In order for this to run asynchronously, you need not only Append and Foreign Scan to be async-capable, but also Hash Join. That's true in either approach. Things are slightly better with the original approach, but the basic problem is there in both cases. So it seems we need an approach that will make adding async capability to a node really cheap, which seems like it might be a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 October 2016 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote: >>> Since Kyotaro Horiguchi found that my previous design had a >>> system-wide performance impact due to the ExecProcNode changes, I >>> decided to take a different approach here: I created an async >>> infrastructure where both the requestor and the requestee have to be >>> specifically modified to support parallelism, and then modified Append >>> and ForeignScan to cooperate using the new interface. Hopefully that >>> means that anything other than those two nodes will suffer no >>> performance impact. Of course, it might have other problems.... >> >> I see that the reason why you re-designed the asynchronous execution >> implementation is because the earlier implementation showed >> performance degradation in local sequential and local parallel scans. >> But I checked that the ExecProcNode() changes were not that >> significant as to cause the degradation. > > I think we need some testing to prove that one way or the other. If > you can do some - say on a plan with multiple nested loop joins with > inner index-scans, which will call ExecProcNode() a lot - that would > be great. I don't think we can just rely on "it doesn't seem like it > should be slower" Agreed. I will come up with some tests. > , though - ExecProcNode() is too important a function > for us to guess at what the performance will be. Also, parent pointers are not required in the new design. Thinking of parent pointers, now it seems the event won't get bubbled up the tree with the new design. But still, , I think it's possible to switch over to the other asynchronous tree when some node in the current subtree is waiting. But I am not sure, will think more on that. > > The thing I'm really worried about with either implementation is what > happens when we start to add asynchronous capability to multiple > nodes. For example, if you imagine a plan like this: > > Append > -> Hash Join > -> Foreign Scan > -> Hash > -> Seq Scan > -> Hash Join > -> Foreign Scan > -> Hash > -> Seq Scan > > In order for this to run asynchronously, you need not only Append and > Foreign Scan to be async-capable, but also Hash Join. That's true in > either approach. Things are slightly better with the original > approach, but the basic problem is there in both cases. So it seems > we need an approach that will make adding async capability to a node > really cheap, which seems like it might be a problem. Yes, we might have to deal with this. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Also, parent pointers are not required in the new design. Thinking of > parent pointers, now it seems the event won't get bubbled up the tree > with the new design. But still, , I think it's possible to switch over > to the other asynchronous tree when some node in the current subtree > is waiting. But I am not sure, will think more on that. The bubbling-up still happens, because each node that made an async request gets a callback with the final response - and if it is itself the recipient of an async request, it can use that callback to respond to that request in turn. This version doesn't bubble up through non-async-aware nodes, but that might be a good thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, this works but ExecAppend gets a bit degradation. At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161003.194632.204401048.horiguchi.kyotaro@lab.ntt.co.jp> > > Some notes: > > > > - EvalPlanQual rechecks are broken. This is fixed by adding (restoring) async-cancelation. > > - EXPLAIN ANALYZE instrumentation is broken. EXPLAIN ANALYE seems working but async-specific information is not available yet. > > - ExecReScanAppend is broken, because the async stuff needs some way > > of canceling an async request and I didn't invent anything like that > > yet. Fixed as EvalPlanQual. > > - The postgres_fdw changes pretend to be async but aren't actually. > > It's just a demo of (part of) the interface at this point. Applied my previous patch with some modification. > > - The postgres_fdw changes also report all pg-fdw paths as > > async-capable, but actually the direct-modify ones aren't, so the > > regression tests fail. All actions other than scan does vacate_connection() to use a connection. > > - Errors in the executor can leak the WaitEventSet. Probably we need > > to modify ResourceOwners to be able to own WaitEventSets. WaitEventSet itself is not leaked but epoll-fd should be closed at failure. This seems doable with TRY-CATCHing in ExecAsyncEventLoop. (not yet) > > - There are probably other bugs, too. > > > > Whee! > > > > Note that I've tried to solve the re-entrancy problems by (1) putting > > all of the event loop's state inside the EState rather than in local > > variables and (2) having the function that is called to report arrival > > of a result be thoroughly different than the function that is used to > > return a tuple to a synchronous caller. > > > > Comments welcome, if you're feeling brave enough to look at anything > > this half-baked. This doesn't cause reentry since this no longer bubbles up tupples through async-unaware nodes. This framework passes tuples through private channels for requestor and requestees. Anyway, I amended this and made postgres_fdw async and then finally all regtests passed with minor modifications. The attached patches are the following. 0001-robert-s-2nd-framework.patchThe patch Robert shown upthread 0002-Fix-some-bugs.patchA small patch to fix complation errors of 0001. 0003-Modify-async-execution-infrastructure.patchSeveral modifications on the infrastructure. The details areshown after themeasurement below. 0004-Make-postgres_fdw-async-capable.patchTrue-async postgres-fdw. gentblr.sql, testrun.sh, calc.plPerformance test script suite. gentblr.sql - creates test tables. testrun.sh - does single test run and calc.pl - drives testrunc.sh and summirize itsresults. I measured performance and had the following result. t0 - SELECT sum(a) FROM <local single table>; pl - SELECT sum(a) FROM <4 local children>; pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; The result is written as "time<ms> (std dev <ms>)" sync t0: 3820.33 ( 1.88) pl: 1608.59 ( 12.06)pf0: 7928.29 ( 46.58)pf1: 8023.16 ( 26.43) async t0: 3806.31 ( 4.49) 0.4% faster (should be error) pl: 1629.17 ( 0.29) 1.3% slowerpf0: 6447.07 ( 25.19) 18.7%fasterpf1: 1876.80 ( 47.13) 76.6% faster t0 is not affected since the ExecProcNode stuff has gone. pl is getting a bit slower. (almost the same to simple seqscan of the previous patch) This should be a misprediction penalty. pf0, pf1 are faster as expected. ======== The below is a summary of modifications made by 0002 and 0003 patch. execAsync.c, execnodes.h: - Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT. - Changed the interface of ExecAsyncRequest to return if a tuple is immediately available or not. - Made ExecAsyncConfigureWait to return if it registered at least one waitevent or not. This is used to know the caller (ExecAsyncEventWait) has a event to wait (for safety). If two or more postgres_fdw nodes are sharing one connection, only one of them can be waited at once. It is a responsibilityto the FDW drivers to ensure at least one wait event to be added but on failure WaitEventSetWait silently waits forever. - There were separate areq->callback_pending and areq->request_complete but they are altering together so they are replacedwith one state variable areq->state. New enum AsyncRequestState for areq->state in execnodes.h. nodeAppend.c: - Return a tuple immediately if ExecAsyncRequest says that a tuple is available. - Reduced nest level of for(;;). nodeForeignscan.[ch], fdwapi.h, execProcnode.c:: - Calling postgresIterateForeignScan can yield tuples in wrong shape. Call ExecForeignScan instead. - Changed the interface of AsyncConfigureWait as execAsync.c. - Added ShutdownForeignScan interface. createplan.c, ruleutils.c, plannodes.h: - With the Rebert's change, explain shows somewhat odd plans where the Output of Append is named after non-parent child.This does not harm but uneasy. Added index of the parent in Append.referent to make it reasoable. (But this looksugly..). Still children in explain are in different order from definition. (expected/postgres_fdw.out is edited) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
This is the rebased version on the current master(-0004), and added resowner stuff (0005) and unlikely(0006). At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp> > > > - Errors in the executor can leak the WaitEventSet. Probably we need > > > to modify ResourceOwners to be able to own WaitEventSets. > > WaitEventSet itself is not leaked but epoll-fd should be closed > at failure. This seems doable with TRY-CATCHing in > ExecAsyncEventLoop. (not yet) Haha, that's a silly talk. The wait event can continue to live when timeout and any error can happen on the way after the that. I added an entry for wait event set to resource owner and hang ones created in ExecAsyncEventWait to TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't do so not to change the current behavior. WaitEventSet doesn't have usable identifier for resowner.c so currently I use the address(pointer value) for the purpose. The patch 0005 does that. > I measured performance and had the following result. > > t0 - SELECT sum(a) FROM <local single table>; > pl - SELECT sum(a) FROM <4 local children>; > pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > > The result is written as "time<ms> (std dev <ms>)" > > sync > t0: 3820.33 ( 1.88) > pl: 1608.59 ( 12.06) > pf0: 7928.29 ( 46.58) > pf1: 8023.16 ( 26.43) > > async > t0: 3806.31 ( 4.49) 0.4% faster (should be error) > pl: 1629.17 ( 0.29) 1.3% slower > pf0: 6447.07 ( 25.19) 18.7% faster > pf1: 1876.80 ( 47.13) 76.6% faster > > t0 is not affected since the ExecProcNode stuff has gone. > > pl is getting a bit slower. (almost the same to simple seqscan of > the previous patch) This should be a misprediction penalty. Using likely macro for ExecAppend, and it seems to have shaken off the degradation. sync t0: 3919.49 ( 5.95) pl: 1637.95 ( 0.75)pf0: 8304.20 ( 43.94)pf1: 8222.09 ( 28.20) async t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) pl: 1617.20 ( 3.51) 1.26% faster (ditto)pf0:6680.95 (478.72) 19.5% fasterpf1: 1886.87 ( 36.25) 77.1% faster regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, this is the 7th patch to make instrumentation work. Explain analyze shows the following result by the previous patch set . | Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4324.676..4324.676 | rows=1 loops=1) | -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=0.910..3663.8 |82 rows=4000000 loops=1) | -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) | (actual time=0.004..0.004 rows=0 loops=1) The current instrument stuff assumes that requested tuple always returns a tuple or the end of tuple comes. This async framework has two point of executing underneath nodes. ExecAsyncRequest and ExecAsyncEventLoop. So I'm not sure if this is appropriate but anyway it seems to show sane numbers. | Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4571.205..4571.206 | rows=1 loops=1) | -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=1.362..3893.1 |14 rows=4000000 loops=1) | -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4) | (actual time=1.056..770.863 rows=1000000 loops=1) | -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4) | (actual time=0.461..767.840 rows=1000000 loops=1) | -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4) | (actual time=0.474..782.547 rows=1000000 loops=1) | -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4) | (actual time=0.156..765.920 rows=1000000 loops=1) | -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) (never executed) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, I'm not sure this is in a sutable shape for commit fest but I decided to register this to ride on the bus for 10.0. > Hi, this is the 7th patch to make instrumentation work. This a PoC patch of asynchronous execution feature, based on a executor infrastructure Robert proposed. These patches are rebased on the current master. 0001-robert-s-2nd-framework.patch Roberts executor async infrastructure. Async-driver nodesregister its async-capable children and sync and data transferaredone out of band of ordinary ExecProcNode channel. So asyncexecution no longer disturbs async-unaware node andslows themdown. 0002-Fix-some-bugs.patch Some fixes for 0001 to work. This is just to preserve the shapeof 0001 patch. 0003-Modify-async-execution-infrastructure.patch The original infrastructure doesn't work when multiple foreigntables is on the same connection. This makes it work. 0004-Make-postgres_fdw-async-capable.patch Makes postgres_fdw to work asynchronously. 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch This addresses a problem pointed by Robers about 0001 patch,that WaitEventSet used for async execution can leak by errors. 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch ExecAppend gets a bit slower by penalties of misprediction ofbranches. This fixes it by using unlikely() macro. 0007-Add-instrumentation-to-async-execution.patch As the description above for 0001, async infrastructure conveystuples outside ExecProcNode channel so EXPLAIN ANALYZE requiresspecialtreat to show sane results. This patch tries that. A result of a performance measurement is in this message. https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp | t0 - SELECT sum(a) FROM <local single table>; | pl - SELECT sum(a) FROM <4 local children>; | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; ... | async | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) | pl: 1617.20 ( 3.51) 1.26% faster (ditto) | pf0: 6680.95 (478.72) 19.5% faster | pf1: 1886.87 ( 36.25) 77.1% faster regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, this is a maintenance post of reased patches. I added a change of ResourceOwnerData missed in 0005. At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp> > This a PoC patch of asynchronous execution feature, based on a > executor infrastructure Robert proposed. These patches are > rebased on the current master. > > 0001-robert-s-2nd-framework.patch > > Roberts executor async infrastructure. Async-driver nodes > register its async-capable children and sync and data transfer > are done out of band of ordinary ExecProcNode channel. So async > execution no longer disturbs async-unaware node and slows them > down. > > 0002-Fix-some-bugs.patch > > Some fixes for 0001 to work. This is just to preserve the shape > of 0001 patch. > > 0003-Modify-async-execution-infrastructure.patch > > The original infrastructure doesn't work when multiple foreign > tables is on the same connection. This makes it work. > > 0004-Make-postgres_fdw-async-capable.patch > > Makes postgres_fdw to work asynchronously. > > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch > > This addresses a problem pointed by Robers about 0001 patch, > that WaitEventSet used for async execution can leak by errors. > > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch > > ExecAppend gets a bit slower by penalties of misprediction of > branches. This fixes it by using unlikely() macro. > > 0007-Add-instrumentation-to-async-execution.patch > > As the description above for 0001, async infrastructure conveys > tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires > special treat to show sane results. This patch tries that. > > > A result of a performance measurement is in this message. > > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp > > > | t0 - SELECT sum(a) FROM <local single table>; > | pl - SELECT sum(a) FROM <4 local children>; > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > ... > | async > | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) > | pl: 1617.20 ( 3.51) 1.26% faster (ditto) > | pf0: 6680.95 (478.72) 19.5% faster > | pf1: 1886.87 ( 36.25) 77.1% faster regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, I cannot respond until next Monday, so I move this to the next CF by myself. At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161115.202513.268072050.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, this is a maintenance post of reased patches. > I added a change of ResourceOwnerData missed in 0005. > > At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp> > > This a PoC patch of asynchronous execution feature, based on a > > executor infrastructure Robert proposed. These patches are > > rebased on the current master. > > > > 0001-robert-s-2nd-framework.patch > > > > Roberts executor async infrastructure. Async-driver nodes > > register its async-capable children and sync and data transfer > > are done out of band of ordinary ExecProcNode channel. So async > > execution no longer disturbs async-unaware node and slows them > > down. > > > > 0002-Fix-some-bugs.patch > > > > Some fixes for 0001 to work. This is just to preserve the shape > > of 0001 patch. > > > > 0003-Modify-async-execution-infrastructure.patch > > > > The original infrastructure doesn't work when multiple foreign > > tables is on the same connection. This makes it work. > > > > 0004-Make-postgres_fdw-async-capable.patch > > > > Makes postgres_fdw to work asynchronously. > > > > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch > > > > This addresses a problem pointed by Robers about 0001 patch, > > that WaitEventSet used for async execution can leak by errors. > > > > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch > > > > ExecAppend gets a bit slower by penalties of misprediction of > > branches. This fixes it by using unlikely() macro. > > > > 0007-Add-instrumentation-to-async-execution.patch > > > > As the description above for 0001, async infrastructure conveys > > tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires > > special treat to show sane results. This patch tries that. > > > > > > A result of a performance measurement is in this message. > > > > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp > > > > > > | t0 - SELECT sum(a) FROM <local single table>; > > | pl - SELECT sum(a) FROM <4 local children>; > > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > > ... > > | async > > | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) > > | pl: 1617.20 ( 3.51) 1.26% faster (ditto) > > | pf0: 6680.95 (478.72) 19.5% faster > > | pf1: 1886.87 ( 36.25) 77.1% faster -- Kyotaro Horiguchi NTT Open Source Software Center
This patch conflicts with e13029a (es_query_dsa) so I rebased this. At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp> > This a PoC patch of asynchronous execution feature, based on a > executor infrastructure Robert proposed. These patches are > rebased on the current master. > > 0001-robert-s-2nd-framework.patch > > Roberts executor async infrastructure. Async-driver nodes > register its async-capable children and sync and data transfer > are done out of band of ordinary ExecProcNode channel. So async > execution no longer disturbs async-unaware node and slows them > down. > > 0002-Fix-some-bugs.patch > > Some fixes for 0001 to work. This is just to preserve the shape > of 0001 patch. > > 0003-Modify-async-execution-infrastructure.patch > > The original infrastructure doesn't work when multiple foreign > tables is on the same connection. This makes it work. > > 0004-Make-postgres_fdw-async-capable.patch > > Makes postgres_fdw to work asynchronously. > > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch > > This addresses a problem pointed by Robers about 0001 patch, > that WaitEventSet used for async execution can leak by errors. > > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch > > ExecAppend gets a bit slower by penalties of misprediction of > branches. This fixes it by using unlikely() macro. > > 0007-Add-instrumentation-to-async-execution.patch > > As the description above for 0001, async infrastructure conveys > tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires > special treat to show sane results. This patch tries that. > > > A result of a performance measurement is in this message. > > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp > > > | t0 - SELECT sum(a) FROM <local single table>; > | pl - SELECT sum(a) FROM <4 local children>; > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > ... > | async > | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) > | pl: 1617.20 ( 3.51) 1.26% faster (ditto) > | pf0: 6680.95 (478.72) 19.5% faster > | pf1: 1886.87 ( 36.25) 77.1% faster -- Kyotaro Horiguchi NTT Open Source Software Center
I noticed that this patch is conflicting with 665d1fa (Logical replication) so I rebased this. Only executor/Makefile conflicted. At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp> > This a PoC patch of asynchronous execution feature, based on a > executor infrastructure Robert proposed. These patches are > rebased on the current master. > > 0001-robert-s-2nd-framework.patch > > Roberts executor async infrastructure. Async-driver nodes > register its async-capable children and sync and data transfer > are done out of band of ordinary ExecProcNode channel. So async > execution no longer disturbs async-unaware node and slows them > down. > > 0002-Fix-some-bugs.patch > > Some fixes for 0001 to work. This is just to preserve the shape > of 0001 patch. > > 0003-Modify-async-execution-infrastructure.patch > > The original infrastructure doesn't work when multiple foreign > tables is on the same connection. This makes it work. > > 0004-Make-postgres_fdw-async-capable.patch > > Makes postgres_fdw to work asynchronously. > > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch > > This addresses a problem pointed by Robers about 0001 patch, > that WaitEventSet used for async execution can leak by errors. > > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch > > ExecAppend gets a bit slower by penalties of misprediction of > branches. This fixes it by using unlikely() macro. > > 0007-Add-instrumentation-to-async-execution.patch > > As the description above for 0001, async infrastructure conveys > tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires > special treat to show sane results. This patch tries that. > > > A result of a performance measurement is in this message. > > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp > > > | t0 - SELECT sum(a) FROM <local single table>; > | pl - SELECT sum(a) FROM <4 local children>; > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > ... > | async > | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) > | pl: 1617.20 ( 3.51) 1.26% faster (ditto) > | pf0: 6680.95 (478.72) 19.5% faster > | pf1: 1886.87 ( 36.25) 77.1% faster regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. The patches still apply, moved to CF 2017-03. Be aware of that: $ git diff HEAD~6 --check contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces. + PendingAsyncRequest *areq, contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces. + bool reinit); src/backend/utils/resowner/resowner.c:1332: new blank line at EOF. -- Michael
Thank you. At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqS0MhZrzgMVQeFEnnKABcsMnNULd8=O0PG7_h-FUp5aEQ@mail.gmail.com> > On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I noticed that this patch is conflicting with 665d1fa (Logical > > replication) so I rebased this. Only executor/Makefile > > conflicted. > > The patches still apply, moved to CF 2017-03. Be aware of that: > $ git diff HEAD~6 --check > contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces. > + PendingAsyncRequest *areq, > contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces. > + bool reinit); > src/backend/utils/resowner/resowner.c:1332: new blank line at EOF. Thank you for letting me know the command. I changed my check scripts to use them and it seems working fine on both commit and rebase. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. I was lucky enough to see an infinite loop when using this patch, which I fixed by this change: diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 588ba18..9b87fbd *** a/src/backend/executor/execAsync.c --- b/src/backend/executor/execAsync.c *************** ExecAsyncEventWait(EState *estate, long *** 364,369 **** --- 364,370 ---- if ((w->events & WL_LATCH_SET) != 0) { + ResetLatch(MyLatch); process_latch_set = true; continue; } Actually _almost_ fixed because at some point one of the following Assert(areq->state == ASYNC_WAITING); statements fired. I think it was the immediately following one, but I can imagine the same to happen in the branch if (process_latch_set) ... I think the wants_process_latch field of PendingAsyncRequest is not useful alone because the process latch can be set for reasons completely unrelated to the asynchronous processing. If the asynchronous node should use latch to signal it's readiness, I think an additional flag is needed in the request which tells ExecAsyncEventWait that the latch was set by the asynchronous node. BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave it ASYNC_WAITING if the data is not ready. In addition, the following comments are based only on code review, I didn't verify my understanding experimentally: * Isn't it possible for AppendState.as_asyncresult to contain multiple responses from the same async node? Since the arraystores TupleTableSlot instead of the actual tuple (so multiple items of as_asyncresult point to the same slot), I suspectthe slot contents might not be defined when the Append node eventually tries to return it to the upper plan. * For the WaitEvent subsystem to work, I think postgres_fdw should keep a separate libpq connection per node, not per usermapping. Currently the connections are cached by user mapping, but it's legal to locate multiple child postgres_fdw nodesof Append plan on the same remote server. I expect that these "co-located" nodes would currently use the same user mappingand therefore the same connection. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houska <ah@cybertec.at> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.
I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsy nc.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----
if ((w->events & WL_LATCH_SET) != 0)
{
+ ResetLatch(MyLatch);
process_latch_set = true;
continue;
}
Hi, I've been testing this patch because seemed like it would help a use case of mine, but can't tell if it's currently working for cases other than a local parent table that has many child partitions which happen to be foreign tables. Is it? I was hoping to use it for a case like:
select x, sum(y) from one_remote_tableunion allselect x, sum(y) from another_remote_tableunion allselect x, sum(y) from a_third_remote_table
but while aggregates do appear to be pushed down, it seems that the remote tables are being queried in sequence. Am I doing something wrong?
Horiguchi-san, On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. With the latest set of patches, I observe a crash due to an Assert failure: #0 0x0000003969632625 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003969633e05 in *__GI_abort () at abort.c:92 #2 0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02 "!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50 "execAsync.c", lineNumber=345) at assert.c:54 #3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8, timeout=-1) at execAsync.c:345 #4 0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8, requestor=0x13c1640, timeout=-1) at execAsync.c:186 #5 0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257 #6 0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411 #7 0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113 #8 0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399 #9 0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at nodeAgg.c:587 #10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at nodeAgg.c:2272 #11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910 #12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514 #13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103 #14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506 #15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8, planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8) at execMain.c:1609 #16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568, direction=ForwardScanDirection, count=0) at execMain.c:341 #17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568, <snip> I was running a query whose plan looked like: explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab group by 1,2 order by 1; QUERY PLAN ------------------------------------------------------Sort Sort Key: ((ptab.tableoid)::regclass) -> HashAggregate Group Key: (ptab.tableoid)::regclass, ptab.a -> Result -> Append -> ForeignScan on ptab_00001 -> Foreign Scan on ptab_00002 -> Foreign Scan on ptab_00003 -> Foreign Scan on ptab_00004 -> Foreign Scan on ptab_00005 -> Foreign Scan on ptab_00006 -> Foreign Scan on ptab_00007 -> Foreign Scanon ptab_00008 -> Foreign Scan on ptab_00009 -> Foreign Scan on ptab_00010 <snip> The snipped part contains Foreign Scans on 90 more foreign partitions (in fact, I could see the crash even with 10 foreign table partitions for the same query). There is a crash in one more case, which seems related to how WaitEventSet objects are manipulated during resource-owner-mediated cleanup of a failed query, such as after the FDW returned an error like below: ERROR: relation "public.ptab_00010" does not exist CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010 The backtrace in this looks like below: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, value=20645152) at resowner.c:301 301 lastidx = resarr->lastidx; (gdb) (gdb) bt #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, value=20645152) at resowner.c:301 #1 0x00000000009c6578 in ResourceOwnerForgetWES (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317 #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 #3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768, phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') at resowner.c:566 #4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768, phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') at resowner.c:485 #5 0x0000000000524172 in AbortTransaction () at xact.c:2588 #6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016 #7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08, dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860 #8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310 #9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982 #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722 #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at postmaster.c:1330 #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228 There is a segfault when accessing the events variable, whose members seem to be pfreed: (gdb) f 2 #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 600 ResourceOwnerForgetWES(set->resowner, set); (gdb) p *set $5 = { nevents = 2139062143, nevents_space = 2139062143, resowner = 0x7f7f7f7f7f7f7f7f, events = 0x7f7f7f7f7f7f7f7f, latch= 0x7f7f7f7f7f7f7f7f, latch_pos = 2139062143, epoll_fd = 2139062143, epoll_ret_events = 0x7f7f7f7f7f7f7f7f } Thanks, Amit
Thank you very much for testing this! At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp> > Horiguchi-san, > > On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote: > > I noticed that this patch is conflicting with 665d1fa (Logical > > replication) so I rebased this. Only executor/Makefile > > conflicted. > > With the latest set of patches, I observe a crash due to an Assert failure: > > #3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8, > timeout=-1) at execAsync.c:345 This means no pending fdw scan didn't let itself go to waiting stage. It leads to a stuck of the whole things. This is caused if no one acutually is waiting for result. I suppose that all of the foreign scans ran on the same connection. Anyway it should be a mistake in state transition. I'll look into it. > I was running a query whose plan looked like: > > explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab > group by 1,2 order by 1; > QUERY PLAN > ------------------------------------------------------ > Sort > Sort Key: ((ptab.tableoid)::regclass) > -> HashAggregate > Group Key: (ptab.tableoid)::regclass, ptab.a > -> Result > -> Append > -> Foreign Scan on ptab_00001 > -> Foreign Scan on ptab_00002 > -> Foreign Scan on ptab_00003 > -> Foreign Scan on ptab_00004 > -> Foreign Scan on ptab_00005 > -> Foreign Scan on ptab_00006 > -> Foreign Scan on ptab_00007 > -> Foreign Scan on ptab_00008 > -> Foreign Scan on ptab_00009 > -> Foreign Scan on ptab_00010 > <snip> > > The snipped part contains Foreign Scans on 90 more foreign partitions (in > fact, I could see the crash even with 10 foreign table partitions for the > same query). Yeah, it seems to me unrelated to how many they are. > There is a crash in one more case, which seems related to how WaitEventSet > objects are manipulated during resource-owner-mediated cleanup of a failed > query, such as after the FDW returned an error like below: > > ERROR: relation "public.ptab_00010" does not exist > CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010 > > The backtrace in this looks like below: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > 301 lastidx = resarr->lastidx; > (gdb) > (gdb) bt > #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > #1 0x00000000009c6578 in ResourceOwnerForgetWES > (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317 > #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > #3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') > at resowner.c:566 > #4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 > '\001') at resowner.c:485 > #5 0x0000000000524172 in AbortTransaction () at xact.c:2588 > #6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016 > #7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08, > dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860 > #8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310 > #9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982 > #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722 > #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at > postmaster.c:1330 > #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228 > > There is a segfault when accessing the events variable, whose members seem > to be pfreed: > > (gdb) f 2 > #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > 600 ResourceOwnerForgetWES(set->resowner, set); > (gdb) p *set > $5 = { > nevents = 2139062143, > nevents_space = 2139062143, > resowner = 0x7f7f7f7f7f7f7f7f, > events = 0x7f7f7f7f7f7f7f7f, > latch = 0x7f7f7f7f7f7f7f7f, > latch_pos = 2139062143, > epoll_fd = 2139062143, > epoll_ret_events = 0x7f7f7f7f7f7f7f7f > } Mmm, I reproduces it quite easily. A silly bug. Something bad is happening between freeing ExecutorState memory context and resource owner. Perhaps the ExecutorState is freed by resowner (as a part of its anscestors) before the memory for the WaitEventSet is freed. It was careless of me. I'll reconsider it. Great thanks for the report. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 16 Feb 2017 21:06:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170216.210600.214980879.horiguchi.kyotaro@lab.ntt.co.jp> > > #3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8, > > timeout=-1) at execAsync.c:345 > > This means no pending fdw scan didn't let itself go to waiting > stage. It leads to a stuck of the whole things. This is caused if > no one acutually is waiting for result. I suppose that all of the > foreign scans ran on the same connection. Anyway it should be a > mistake in state transition. I'll look into it. ... > > I was running a query whose plan looked like: > > > > explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab > > group by 1,2 order by 1; > > QUERY PLAN > > ------------------------------------------------------ > > Sort > > Sort Key: ((ptab.tableoid)::regclass) > > -> HashAggregate > > Group Key: (ptab.tableoid)::regclass, ptab.a > > -> Result > > -> Append > > -> Foreign Scan on ptab_00001 > > -> Foreign Scan on ptab_00002 > > -> Foreign Scan on ptab_00003 > > -> Foreign Scan on ptab_00004 > > -> Foreign Scan on ptab_00005 > > -> Foreign Scan on ptab_00006 > > -> Foreign Scan on ptab_00007 > > -> Foreign Scan on ptab_00008 > > -> Foreign Scan on ptab_00009 > > -> Foreign Scan on ptab_00010 > > <snip> > > > > The snipped part contains Foreign Scans on 90 more foreign partitions (in > > fact, I could see the crash even with 10 foreign table partitions for the > > same query). > > Yeah, it seems to me unrelated to how many they are. Finally, I couldn't see the crash for the (maybe) same case. I can guess two reasons for this. One is that a situation where node->as_nasyncpending differs from estate->es_num_pending_async, but I couldn't find a possibility. Another is a situation in postgresIterateForeignScan where the "next owner" reaches eof but another waiter is not. I haven't reproduce the situation but fixed it for the case. Addition to that I found a bug in ExecAsyncAppendResponse. It calls bms_add_member inappropriate way. > Mmm, I reproduces it quite easily. A silly bug. > > Something bad is happening between freeing ExecutorState memory > context and resource owner. Perhaps the ExecutorState is freed by > resowner (as a part of its anscestors) before the memory for the > WaitEventSet is freed. It was careless of me. I'll reconsider it. The cause was that the WaitEventSet was placed in ExecutorState but registered to TopTransactionResourceOwner. I fixed it. This fixes are made on top of the previous patches for now. In the attached files, 0008, 0009 are for the second bug, 0012 is for the first bug. And 0013 is for bms bug. Sorry for the confused patches, I will resend more neater ones soon. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, I totally reorganized the patch set to four pathces on the current master (9e43e87). At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170222.173945.262776579.horiguchi.kyotaro@lab.ntt.co.jp> > Finally, I couldn't see the crash for the (maybe) same case. I > can guess two reasons for this. One is that a situation where > node->as_nasyncpending differs from estate->es_num_pending_async, > but I couldn't find a possibility. Another is a situation in > postgresIterateForeignScan where the "next owner" reaches eof but > another waiter is not. I haven't reproduce the situation but > fixed it for the case. Addition to that I found a bug in > ExecAsyncAppendResponse. It calls bms_add_member inappropriate > way. This found to be wrong. The true problem here was (maybe) that ExecAsyncRequest can complete a tuple immediately. This causes multiple calling to ExecAsyncRequest for the same child at once. (For the case, the processing node is added again to node->as_needrequest before ExecAsyncRequest returns.) Using a copy of node->as_needrequest will fix this but it is uneasy so I changed ExecAsyncRequest not to return a tuple immediately. Instaed, ExecAsyncEventLoop skips waiting if no node to wait. The tuple previously "response"'ed in ExecAsyncRequest is now responsed here. Addition to that, the current policy of preserving of es_wait_event_set doesn't seem to work with the async-capable postgres_fdw. So the current code cleares it at every entering to ExecAppend. This needs more thoughts. I measured the performance of async-execution and it was quite good from the previous version especially for single-connection environment. pf0: 4 foreign tables on single connection non async : (prev) 7928ms -> (this time)7993ms async : (prev) 6447ms ->(this time)3211ms pf1: 4 foreign tables on dedicate connection for every table non async : (prev) 8023ms -> (this time)7953ms async :(prev) 1876ms -> (this time)1841ms Boost rate by async execution is 60% for single connectsion and 77% for dedicate connection environment. > > Mmm, I reproduces it quite easily. A silly bug. > > > > Something bad is happening between freeing ExecutorState memory > > context and resource owner. Perhaps the ExecutorState is freed by > > resowner (as a part of its anscestors) before the memory for the > > WaitEventSet is freed. It was careless of me. I'll reconsider it. > > The cause was that the WaitEventSet was placed in ExecutorState > but registered to TopTransactionResourceOwner. I fixed it. The attached patches are the following. 0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch Allows WaitEventSet to released by resource owner. 0002-Asynchronous-execution-framework.patch Asynchronous execution framework based on Robert's version. All edits on thisis merged. 0003-Make-postgres_fdw-async-capable.patch Make postgres_fdw async-capable. 0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch This can be merged to 0002 but I didn't since the usage of using these pragmas is arguable. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Patch fails on current master, but correctly applies to 9e43e87. Thanks for including the commit id.
Regression tests pass.
As with my last attempt at reviewing this patch, I'm confused about what kind of queries can take advantage of this patch. Is it only cases where a local table has multiple inherited foreign table children? Will it work with queries where two foreign tables are referenced and combined with a UNION ALL?
Regression tests pass.
As with my last attempt at reviewing this patch, I'm confused about what kind of queries can take advantage of this patch. Is it only cases where a local table has multiple inherited foreign table children? Will it work with queries where two foreign tables are referenced and combined with a UNION ALL?
On 2017/03/11 8:19, Corey Huinker wrote: > > On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp <mailto:horiguchi.kyotaro@lab.ntt.co.jp>> > wrote: > > 9e43e87 > > > Patch fails on current master, but correctly applies to 9e43e87. Thanks > for including the commit id. > > Regression tests pass. > > As with my last attempt at reviewing this patch, I'm confused about what > kind of queries can take advantage of this patch. Is it only cases where a > local table has multiple inherited foreign table children? IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's driven by postgres_fdw (after building some relevant infrastructure first). The same might be added to other Scan nodes (and probably other nodes as well) eventually so that more queries will benefit from asynchronous execution. It may just be that ForeignScan's benefit more from asynchronous execution than other Scan types. > Will it work > with queries where two foreign tables are referenced and combined with a > UNION ALL? I think it will, because Append itself has been made async-capable by one of the patches and UNION ALL uses Append. But as mentioned above, only the postgres_fdw foreign tables will be able to utilize this for now. Thanks, Amit
I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.
Ok, I'll re-run my test from a few weeks back and see if anything has changed.
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com> wrote:
I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.Ok, I'll re-run my test from a few weeks back and see if anything has changed.
I'm not able to discern any difference in plan between a 9.6 instance and this patch.
The basic outline of my test is:
EXPLAIN ANALYZESELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'UNION ALLSELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'UNION ALLSELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'UNION ALLSELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'
I've tried this test where tab1 through tab4 all are the same postgres_fdw foreign table.
I've tried this test where tab1 through tab4 all are different foreign tables pointing to the same remote table sharing a the same server definition.
I've tried this test where tab1 through tab4 all are different foreign tables pointing each with it's own foreign server definition, all of which happen to point to the same remote cluster.
Are there some postgresql.conf settings I should set to get a decent test?
Are there some postgresql.conf settings I should set to get a decent test?
On 2017/03/14 6:31, Corey Huinker wrote: > On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com> > wrote: > >> >>> I think it will, because Append itself has been made async-capable by one >>> of the patches and UNION ALL uses Append. But as mentioned above, only >>> the postgres_fdw foreign tables will be able to utilize this for now. >>> >>> >> Ok, I'll re-run my test from a few weeks back and see if anything has >> changed. >> > > > I'm not able to discern any difference in plan between a 9.6 instance and > this patch. > > The basic outline of my test is: > > EXPLAIN ANALYZE > SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago' > > > I've tried this test where tab1 through tab4 all are the same postgres_fdw > foreign table. > I've tried this test where tab1 through tab4 all are different foreign > tables pointing to the same remote table sharing a the same server > definition. > I've tried this test where tab1 through tab4 all are different foreign > tables pointing each with it's own foreign server definition, all of which > happen to point to the same remote cluster. > > Are there some postgresql.conf settings I should set to get a decent test? I don't think the plan itself will change as a result of applying this patch. You might however be able to observe some performance improvement. Thanks, Amit
patch. You might however be able to observe some performance improvement.I don't think the plan itself will change as a result of applying this
Thanks,
Amit
I could see no performance improvement, even with 16 separate queries combined with UNION ALL. Query performance was always with +/- 10% of a 9.6 instance given the same script. I must be missing something.
On 2017/03/14 10:08, Corey Huinker wrote: >> I don't think the plan itself will change as a result of applying this >> patch. You might however be able to observe some performance improvement. > > I could see no performance improvement, even with 16 separate queries > combined with UNION ALL. Query performance was always with +/- 10% of a 9.6 > instance given the same script. I must be missing something. Hmm, maybe I'm missing something too. Anyway, here is an older message on this thread from Horiguchi-san where he shared some of the test cases that this patch improves performance for: https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp From that message: <quote> I measured performance and had the following result. t0 - SELECT sum(a) FROM <local single table>; pl - SELECT sum(a) FROM <4 local children>; pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; The result is written as "time<ms> (std dev <ms>)" sync t0: 3820.33 ( 1.88) pl: 1608.59 ( 12.06)pf0: 7928.29 ( 46.58)pf1: 8023.16 ( 26.43) async t0: 3806.31 ( 4.49) 0.4% faster (should be error) pl: 1629.17 ( 0.29) 1.3% slowerpf0: 6447.07 ( 25.19) 18.7%fasterpf1: 1876.80 ( 47.13) 76.6% faster </quote> IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the same server) measured with different implementations of the patch. Thanks, Amit
On Mon, Mar 13, 2017 at 9:28 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/03/14 10:08, Corey Huinker wrote:
>> I don't think the plan itself will change as a result of applying this
>> patch. You might however be able to observe some performance improvement.
>
> I could see no performance improvement, even with 16 separate queries
> combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
> instance given the same script. I must be missing something.
Hmm, maybe I'm missing something too.
Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:
https://www.postgresql.org/message-id/20161018.103051. 30820907.horiguchi.kyotaro% 40lab.ntt.co.jp
From that message:
<quote>
I measured performance and had the following result.
t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
The result is written as "time<ms> (std dev <ms>)"
sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)
async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>
IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.
Thanks,
Amit
I reworked the test such that all of the foreign tables inherit from the same parent table, and if you query that you do get async execution. But It doesn't work when just stringing together those foreign tables with UNION ALLs.
I don't know how to proceed with this review if that was a goal of the patch.
Corey Huinker <corey.huinker@gmail.com> writes: > I reworked the test such that all of the foreign tables inherit from the > same parent table, and if you query that you do get async execution. But It > doesn't work when just stringing together those foreign tables with UNION > ALLs. > I don't know how to proceed with this review if that was a goal of the > patch. Whether it was a goal or not, I'd say there is something either broken or incorrectly implemented if you don't see that. The planner (and therefore also the executor) generally treats inheritance the same as simple UNION ALL. If that's not the case here, I'd want to know why. regards, tom lane
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> I reworked the test such that all of the foreign tables inherit from the
> same parent table, and if you query that you do get async execution. But It
> doesn't work when just stringing together those foreign tables with UNION
> ALLs.
> I don't know how to proceed with this review if that was a goal of the
> patch.
Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.
regards, tom lane
Updated commitfest entry to "Returned With Feedback".
At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=cBZEX9L9HnhJYrtfiAN5Ebdu=xbvM_poWVGBR7yN3gVw@mail.gmail.com> > On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Corey Huinker <corey.huinker@gmail.com> writes: > > > I reworked the test such that all of the foreign tables inherit from the > > > same parent table, and if you query that you do get async execution. But > > It > > > doesn't work when just stringing together those foreign tables with UNION > > > ALLs. > > > > > I don't know how to proceed with this review if that was a goal of the > > > patch. > > > > Whether it was a goal or not, I'd say there is something either broken > > or incorrectly implemented if you don't see that. The planner (and > > therefore also the executor) generally treats inheritance the same as > > simple UNION ALL. If that's not the case here, I'd want to know why. > > > > regards, tom lane > > > > Updated commitfest entry to "Returned With Feedback". Sorry for the absense. For information, I'll continue to write some more. At Tue, 14 Mar 2017 10:28:36 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <e7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp> > async > t0: 3806.31 ( 4.49) 0.4% faster (should be error) > pl: 1629.17 ( 0.29) 1.3% slower > pf0: 6447.07 ( 25.19) 18.7% faster > pf1: 1876.80 ( 47.13) 76.6% faster > </quote> > > IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the > same server) measured with different implementations of the patch. pf0 is measured on a partitioned(sharded) tables on one foreign server, that is, sharing a connection. pf1 is in contrast sharded tables that have dedicate server (or connection). The parent server is async-patched and the child server is not patched. Async-capable plan is generated in planner. An Append contains at least one async-capable child becomes async-aware Append. So the async feature should be effective also for the UNION ALL case. The following will work faster than unpatched version.I SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROMft40) as ft; I'll measure the performance for the case next week. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. This is the final report in this CF period. At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.173505.152063931.horiguchi.kyotaro@lab.ntt.co.jp> > Async-capable plan is generated in planner. An Append contains at > least one async-capable child becomes async-aware Append. So the > async feature should be effective also for the UNION ALL case. > > The following will work faster than unpatched version.I > > SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROMft40) as ft; > > I'll measure the performance for the case next week. I found that the following query works as the same as partitioned table. SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROMft40 UNION ALL *SELECT a FROM ONLY pf0*) as ft; So, the difference comes from the additional async-uncapable child (faster if contains any). In both cases, Append node runs children asynchronously but slightly differently when all async-capable children are busy. I'll continue working on this from this point aiming to the next commit fest. Thank you for valuable feedback. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
I'll continue working on this from this point aiming to the next
commit fest.
This probably will not surprise you given the many commits in the past 2 weeks, but the patches no longer apply to master:
$ git apply ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
$ git apply ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27: trailing whitespace.
FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39: trailing whitespace.
#include "utils/resowner_private.h"
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47: trailing whitespace.
ResourceOwner resowner; /* Resource owner */
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48: trailing whitespace.
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57: trailing whitespace.
WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
error: patch failed: src/backend/libpq/pqcomm.c:201
error: src/backend/libpq/pqcomm.c: patch does not apply
error: patch failed: src/backend/storage/ipc/latch.c:61
error: src/backend/storage/ipc/latch.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/condition_variable.c:66
error: src/backend/storage/lmgr/condition_variable.c: patch does not apply
error: patch failed: src/backend/utils/resowner/resowner.c:124
error: src/backend/utils/resowner/resowner.c: patch does not apply
error: patch failed: src/include/storage/latch.h:101
error: src/include/storage/latch.h: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:18
error: src/include/utils/resowner_private.h: patch does not apply
Hello, At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=dN_vt8kazOoiVOfjN6xFHpzf5uiGJz+iN+f4fLbYwSKA@mail.gmail.com> > > > > > > I'll continue working on this from this point aiming to the next > > commit fest. > > > > > This probably will not surprise you given the many commits in the past 2 > weeks, but the patches no longer apply to master: Yeah, I won't surprise by that but thank you for noticing me. Greately reduces the difficulty of merging. Thank you. > $ git apply > ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch > /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27: > trailing whitespace. Maybe the patch was retrieved on Windows then transferred to Linux box. Converting EOLs of the files or some git configuration might save that. (git am has --no-keep-cr but I haven't find that for git apply) The attached patch is rebased on the current master, but no substantial changes other than disallowing partitioned tables on async by assertion. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170404.192539.29699823.horiguchi.kyotaro@lab.ntt.co.jp> > The attached patch is rebased on the current master, but no > substantial changes other than disallowing partitioned tables on > async by assertion. This is just rebased onto the current master (d761fe2). I'll recheck further detail after this. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170522.131214.20936668.horiguchi.kyotaro@lab.ntt.co.jp> > > The attached patch is rebased on the current master, but no > > substantial changes other than disallowing partitioned tables on > > async by assertion. > > This is just rebased onto the current master (d761fe2). > I'll recheck further detail after this. Sorry, the patch was missing some files to add. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
The patch got conflicted. This is a new version just rebased to the current master. Furtuer amendment will be taken later. > The attached patch is rebased on the current master, but no > substantial changes other than disallowing partitioned tables on > async by assertion. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > The patch got conflicted. This is a new version just rebased to > the current master. Furtuer amendment will be taken later. Can you please explain this part of make_append() ? /* Currently async on partitioned tables is not available */ Assert(nasyncplans == 0 || partitioned_rels == NIL); I don't think the output of Append plan is supposed to be ordered even if the underlying relation is partitioned. Besides ordering, is there any other reason not to use the asynchronous execution? And even if there was some, the planner should ensure that executor does not fire the assertion statement above. The script attached shows an example how to cause the assertion failure. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > The patch got conflicted. This is a new version just rebased to > the current master. Furtuer amendment will be taken later. Just one idea that I had while reading the code. In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the complete requests to the end and finaly adjust estate->es_num_pending_async so that the array no longer contains the complete requests. I think the point is that then you can add new requests to the end of the array. I wonder if a set (Bitmapset) of incomplete requests would make the code more efficient. The set would contain position of each incomplete request in estate->es_num_pending_async (I think it's the myindex field of PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the requests subject to ExecAsyncNotify etc, then the compaction of estate->es_pending_async wouldn't be necessary. ExecAsyncRequest would use the set to look for space for new requests by iterating it and trying to find the first gap (which corresponds to completed request). And finally, item would be removed from the set at the moment the request state is being set to ASYNCREQ_COMPLETE. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Thank you for looking this. At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska <ah@cybertec.at> wrote in <4579.1498638234@localhost> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > The patch got conflicted. This is a new version just rebased to > > the current master. Furtuer amendment will be taken later. > > Can you please explain this part of make_append() ? > > /* Currently async on partitioned tables is not available */ > Assert(nasyncplans == 0 || partitioned_rels == NIL); > > I don't think the output of Append plan is supposed to be ordered even if the > underlying relation is partitioned. Besides ordering, is there any other > reason not to use the asynchronous execution? It was just a developmental sentinel that will remind me later to consider the declarative partitions since I didn't have an idea of the differences (or the similarity) between appendrels and partitioned_rels. It is never to say the condition cannot make. I'll check it out and will support partitioned_rels sooner. Sorry for having left it as it is. > And even if there was some, the planner should ensure that executor does not > fire the assertion statement above. The script attached shows an example how > to cause the assertion failure. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote: > Thank you for looking this. > > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote: >> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >>> The patch got conflicted. This is a new version just rebased to >>> the current master. Furtuer amendment will be taken later. >> >> Can you please explain this part of make_append() ? >> >> /* Currently async on partitioned tables is not available */ >> Assert(nasyncplans == 0 || partitioned_rels == NIL); >> >> I don't think the output of Append plan is supposed to be ordered even if the >> underlying relation is partitioned. Besides ordering, is there any other >> reason not to use the asynchronous execution? > > It was just a developmental sentinel that will remind me later to > consider the declarative partitions since I didn't have an idea > of the differences (or the similarity) between appendrels and > partitioned_rels. It is never to say the condition cannot > make. I'll check it out and will support partitioned_rels sooner. > Sorry for having left it as it is. When making an Append for a partitioned table, among the arguments passed to make_append(), 'partitioned_rels' is a list of RT indexes of partitioned tables in the inheritance tree of which the aforementioned partitioned table is the root. 'appendplans' is a list of subplans for scanning the leaf partitions in the tree. Note that the 'appendplans' list contains no members corresponding to the partitioned tables, because we don't need to scan them (only leaf relations contain any data). The point of having the 'partitioned_rels' list in the resulting Append plan is so that the executor can identify those relations and take the appropriate locks on them. Thanks, Amit
Hi, I've returned. At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <63a5a01c-2967-83e0-8bbf-c981404f529e@lab.ntt.co.jp> > Hi, > > On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote: > > Thank you for looking this. > > > > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote: > >> Can you please explain this part of make_append() ? > >> > >> /* Currently async on partitioned tables is not available */ > >> Assert(nasyncplans == 0 || partitioned_rels == NIL); > >> > >> I don't think the output of Append plan is supposed to be ordered even if the > >> underlying relation is partitioned. Besides ordering, is there any other > >> reason not to use the asynchronous execution? > > When making an Append for a partitioned table, among the arguments passed > to make_append(), 'partitioned_rels' is a list of RT indexes of > partitioned tables in the inheritance tree of which the aforementioned > partitioned table is the root. 'appendplans' is a list of subplans for > scanning the leaf partitions in the tree. Note that the 'appendplans' > list contains no members corresponding to the partitioned tables, because > we don't need to scan them (only leaf relations contain any data). > > The point of having the 'partitioned_rels' list in the resulting Append > plan is so that the executor can identify those relations and take the > appropriate locks on them. Amit, thank you for the detailed explanation. I understand what it is and that just ignoring it is enough, then confirmed that actually works as before. I'll then adresss Antonin's comments tomorrow. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for the thought. This is at PoC level so I'd be grateful for this kind of fundamental comments. At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska <ah@cybertec.at> wrote in <392.1498674144@localhost> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > The patch got conflicted. This is a new version just rebased to > > the current master. Furtuer amendment will be taken later. > > Just one idea that I had while reading the code. > > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the > complete requests to the end and finaly adjust estate->es_num_pending_async so > that the array no longer contains the complete requests. I think the point is > that then you can add new requests to the end of the array. > > I wonder if a set (Bitmapset) of incomplete requests would make the code more > efficient. The set would contain position of each incomplete request in > estate->es_num_pending_async (I think it's the myindex field of > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the > requests subject to ExecAsyncNotify etc, then the compaction of > estate->es_pending_async wouldn't be necessary. > > ExecAsyncRequest would use the set to look for space for new requests by > iterating it and trying to find the first gap (which corresponds to completed > request). > > And finally, item would be removed from the set at the moment the request > state is being set to ASYNCREQ_COMPLETE. Effectively it is a waiting-queue followed by a completed-list. The point of the compaction is keeping the order of waiting or not-yet-completed requests, which is crucial to avoid kind-a precedence inversion. We cannot keep the order by using bitmapset in such way. The current code waits all waiters at once and processes all fired events at once. The order in the waiting-queue is inessential in the case. On the other hand I suppoese waiting on several-tens to near-hundred remote hosts is in a realistic target range. Keeping the order could be crucial if we process a part of the queue at once in the case. Putting siginificance on the deviation of response time of remotes, process-all-at-once is effective. In turn we should consider the effectiveness of the lifecycle of the larger wait event set. Sorry for the discursive discussion but in short, I have noticed that I have a lot to consider on this:p Thanks! regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Just one idea that I had while reading the code. > > > > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the > > complete requests to the end and finaly adjust estate->es_num_pending_async so > > that the array no longer contains the complete requests. I think the point is > > that then you can add new requests to the end of the array. > > > > I wonder if a set (Bitmapset) of incomplete requests would make the code more > > efficient. The set would contain position of each incomplete request in > > estate->es_num_pending_async (I think it's the myindex field of > > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the > > requests subject to ExecAsyncNotify etc, then the compaction of > > estate->es_pending_async wouldn't be necessary. > > > > ExecAsyncRequest would use the set to look for space for new requests by > > iterating it and trying to find the first gap (which corresponds to completed > > request). > > > > And finally, item would be removed from the set at the moment the request > > state is being set to ASYNCREQ_COMPLETE. > > Effectively it is a waiting-queue followed by a > completed-list. The point of the compaction is keeping the order > of waiting or not-yet-completed requests, which is crucial to > avoid kind-a precedence inversion. We cannot keep the order by > using bitmapset in such way. > The current code waits all waiters at once and processes all > fired events at once. The order in the waiting-queue is > inessential in the case. On the other hand I suppoese waiting on > several-tens to near-hundred remote hosts is in a realistic > target range. Keeping the order could be crucial if we process a > part of the queue at once in the case. > > Putting siginificance on the deviation of response time of > remotes, process-all-at-once is effective. In turn we should > consider the effectiveness of the lifecycle of the larger wait > event set. ok, I missed the fact that the order of es_pending_async entries is important. I think this is worth adding a comment. Actually the reason I thought of simplification was that I noticed small inefficiency in the way you do the compaction. In particular, I think it's not always necessary to swap the tail and head entries. Would something like this make sense? /* If any node completed, compact the array. */ if (any_node_done) { int hidx = 0, tidx; /* * Swap all non-yet-completed items to the start of the array. * Keep them in the same order. */ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx) { PendingAsyncRequest *tail= estate->es_pending_async[tidx]; Assert(tail->state != ASYNCREQ_CALLBACK_PENDING); if (tail->state == ASYNCREQ_COMPLETE) continue; /* * If the array starts with one or more incomplete requests, * both head and tail pointat the same item, so there's no * point in swapping. */ if (tidx > hidx) { PendingAsyncRequest *head = estate->es_pending_async[hidx]; /* * Once the tail got ahead, it should only leave * ASYNCREQ_COMPLETE behind.Only those can then be seen * by head. */ Assert(head->state == ASYNCREQ_COMPLETE); estate->es_pending_async[tidx] = head; estate->es_pending_async[hidx] = tail; } ++hidx; } estate->es_num_pending_async = hidx; } And besides that, I think it'd be more intuitive if the meaning of "head" and "tail" was reversed: if the array is iterated from lower to higher positions, then I'd consider head to be at higher position, not tail. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Hello, At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska <ah@cybertec.at> wrote in <6448.1499761731@localhost> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Effectively it is a waiting-queue followed by a > > completed-list. The point of the compaction is keeping the order > > of waiting or not-yet-completed requests, which is crucial to > > avoid kind-a precedence inversion. We cannot keep the order by > > using bitmapset in such way. > > > The current code waits all waiters at once and processes all > > fired events at once. The order in the waiting-queue is > > inessential in the case. On the other hand I suppoese waiting on > > several-tens to near-hundred remote hosts is in a realistic > > target range. Keeping the order could be crucial if we process a > > part of the queue at once in the case. > > > > Putting siginificance on the deviation of response time of > > remotes, process-all-at-once is effective. In turn we should > > consider the effectiveness of the lifecycle of the larger wait > > event set. > > ok, I missed the fact that the order of es_pending_async entries is > important. I think this is worth adding a comment. I'll put an upper limit to the number of waiters processed at once. Then add a comment like that. > Actually the reason I thought of simplification was that I noticed small > inefficiency in the way you do the compaction. In particular, I think it's not > always necessary to swap the tail and head entries. Would something like this > make sense? I'm not sure, but I suppose that it is rare that all of the first many elements in the array are not COMPLETE. In most cases the first element gets a response first. > > /* If any node completed, compact the array. */ > if (any_node_done) > { ... > for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx) > { ... > if (tail->state == ASYNCREQ_COMPLETE) > continue; > > /* > * If the array starts with one or more incomplete requests, > * both head and tail point at the same item, so there's no > * point in swapping. > */ > if (tidx > hidx) > { This works to skip the first several elements when all of them are ASYNCREQ_COMPLETE. I think it makes sense as long as it doesn't harm the loop. The optimization is more effective by putting out of the loop like this. | for (tidx = 0; tidx < estate->es_num_pending_async && estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE;++tidx); | for (; tidx < estate->es_num_pending_async; ++tidx) ... > And besides that, I think it'd be more intuitive if the meaning of "head" and > "tail" was reversed: if the array is iterated from lower to higher positions, > then I'd consider head to be at higher position, not tail. Yeah, but maybe the "head" is still confusing even if reversed because it is still not a head of something. It might be less confusing by rewriting it in more verbose-but-straightforwad way. | int npending = 0; | | /* Skip over not-completed items at the beginning */ | while (npending < estate->es_num_pending_async && | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE) | npending++; | | /* Scan over the rest for not-completed items */ | for (i = npending + 1 ; i < estate->es_num_pending_async; ++i) | { | PendingAsyncRequest *tmp; | PendingAsyncRequest *curr = estate->es_pending_async[i]; | | if (curr->state == ASYNCREQ_COMPLETE) | continue; | | /* Move the not-completed item to the tail of the first chunk */ | tmp = estate->es_pending_async[i]; | estate->es_pending_async[nepending] = tmp; | estate->es_pending_async[i] = tmp; | ++npending; | } regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, 8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase this and added a patch to refactor the function that Anotonin pointed. This would be merged into 0002 patch. At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170718.162452.221576658.horiguchi.kyotaro@lab.ntt.co.jp> > I'll put an upper limit to the number of waiters processed at > once. Then add a comment like that. > > > Actually the reason I thought of simplification was that I noticed small > > inefficiency in the way you do the compaction. In particular, I think it's not > > always necessary to swap the tail and head entries. Would something like this > > make sense? > > I'm not sure, but I suppose that it is rare that all of the first > many elements in the array are not COMPLETE. In most cases the > first element gets a response first. ... > Yeah, but maybe the "head" is still confusing even if reversed > because it is still not a head of something. It might be less > confusing by rewriting it in more verbose-but-straightforwad way. > > > | int npending = 0; > | > | /* Skip over not-completed items at the beginning */ > | while (npending < estate->es_num_pending_async && > | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE) > | npending++; > | > | /* Scan over the rest for not-completed items */ > | for (i = npending + 1 ; i < estate->es_num_pending_async; ++i) > | { > | PendingAsyncRequest *tmp; > | PendingAsyncRequest *curr = estate->es_pending_async[i]; > | > | if (curr->state == ASYNCREQ_COMPLETE) > | continue; > | > | /* Move the not-completed item to the tail of the first chunk */ > | tmp = estate->es_pending_async[i]; > | estate->es_pending_async[nepending] = tmp; > | estate->es_pending_async[i] = tmp; > | ++npending; > | } The last patch does something like this (with apparent bugs fixed) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > [ new patches ] I spent some time today refreshing my memory of what's going with this thread today. Ostensibly, the advantage of this framework over my previous proposal is that it avoids inserting anything into ExecProcNode(), which is probably a good thing to avoid given how frequently ExecProcNode() is called. Unless the parent and the child both know about asynchronous execution and choose to use it, everything runs exactly as it does today and so there is no possibility of a complaint about a performance hit. As far as it goes, that is good. However, at a deeper level, I fear we haven't really solved the problem. If an Append is directly on top of a ForeignScan node, then this will work. But if an Append is indirectly on top of a ForeignScan node with some other stuff in the middle, then it won't - unless we make whichever nodes appear between the Append and the ForeignScan async-capable. Indeed, we'd really want all kinds of joins and aggregates to be async-capable so that examples like the one Corey asked about in http://postgr.es/m/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=Sb4Q5rJsRA@mail.gmail.com will work. But if we do, then I fear we'll just be reintroducing the same performance regression that we introduced by switching to this framework from the previous one - or maybe a different one, but a regression all the same. Every type of intermediate node will have to have a code path where it uses ExecAsyncRequest() / ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and it seems like that will either end up duplicating a lot of code from the regular code path or, alternatively, polluting the regular code path with some of the async code's concerns to avoid duplication, and maybe slowing things down. Maybe that concern is unjustified; I'm not sure. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Ostensibly, the advantage of this framework over my previous proposal > is that it avoids inserting anything into ExecProcNode(), which is > probably a good thing to avoid given how frequently ExecProcNode() is > called. Unless the parent and the child both know about asynchronous > execution and choose to use it, everything runs exactly as it does > today and so there is no possibility of a complaint about a > performance hit. As far as it goes, that is good. > However, at a deeper level, I fear we haven't really solved the > problem. If an Append is directly on top of a ForeignScan node, then > this will work. But if an Append is indirectly on top of a > ForeignScan node with some other stuff in the middle, then it won't - > unless we make whichever nodes appear between the Append and the > ForeignScan async-capable. I have not been paying any attention to this thread whatsoever, but I wonder if you can address your problem by building on top of the ExecProcNode replacement that Andres is working on, https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de The scheme he has allows $extra_stuff to be injected into ExecProcNode at no cost when $extra_stuff is not needed, because you simply don't insert the wrapper function when it's not needed. I'm not sure that it will scale well to several different kinds of insertions though, for instance if you wanted both instrumentation and async support on the same node. But maybe those two couldn't be arms-length from each other anyway, in which case it might be fine as-is. regards, tom lane
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I have not been paying any attention to this thread whatsoever, > but I wonder if you can address your problem by building on top of > the ExecProcNode replacement that Andres is working on, > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > no cost when $extra_stuff is not needed, because you simply don't insert > the wrapper function when it's not needed. I'm not sure that it will > scale well to several different kinds of insertions though, for instance > if you wanted both instrumentation and async support on the same node. > But maybe those two couldn't be arms-length from each other anyway, > in which case it might be fine as-is. Yeah, I don't quite see how that would apply in this case -- what we need here is not as simple as just conditionally injecting an extra bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for the comment. At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com> > But if we do, then I fear we'll just be reintroducing the same > performance regression that we introduced by switching to this > framework from the previous one - or maybe a different one, but a > regression all the same. Every type of intermediate node will have to > have a code path where it uses ExecAsyncRequest() / > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and I understand what Robert concerns and I think I share the same opinion. It needs further different framework. At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com> > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I have not been paying any attention to this thread whatsoever, > > but I wonder if you can address your problem by building on top of > > the ExecProcNode replacement that Andres is working on, > > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de > > > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > > no cost when $extra_stuff is not needed, because you simply don't insert > > the wrapper function when it's not needed. I'm not sure that it will > > scale well to several different kinds of insertions though, for instance > > if you wanted both instrumentation and async support on the same node. > > But maybe those two couldn't be arms-length from each other anyway, > > in which case it might be fine as-is. > > Yeah, I don't quite see how that would apply in this case -- what we > need here is not as simple as just conditionally injecting an extra > bit. Thank you for the pointer, Tom. The subject (segfault in HEAD...) haven't made me think that this kind of discussion was held there. Anyway it seems very closer to asynchronous execution so I'll catch up it considering how I can associate with this. Regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp> > Thank you for the comment. > > At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com> > > regression all the same. Every type of intermediate node will have to > > have a code path where it uses ExecAsyncRequest() / > > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and > > I understand what Robert concerns and I share the same > opinion. It needs further different framework. > > At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com> > > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > > > no cost when $extra_stuff is not needed, because you simply don't insert > > > the wrapper function when it's not needed. I'm not sure that it will ... > > Yeah, I don't quite see how that would apply in this case -- what we > > need here is not as simple as just conditionally injecting an extra > > bit. > > Thank you for the pointer, Tom. The subject (segfault in HEAD...) > haven't made me think that this kind of discussion was held > there. Anyway it seems very closer to asynchronous execution so > I'll catch up it considering how I can associate with this. I understand the executor change which has just been made at master based on the pointed thread. This seems to have the capability to let exec-node switch to async-aware with no extra cost on non-async processing. So it would be doable to (just) *shrink* the current framework by detaching the async-aware side of the API. But to get the most out of asynchrony, it is required that multiple async-capable nodes distributed under async-unaware nodes run simultaneously. There seems two ways to achieve this. One is propagating required-async-nodes bitmap up to the topmost node and waiting for the all required nodes to become ready. In the long run this requires all nodes to be async-aware and that apparently would have bad effect on performance to async-unaware nodes containing async-capable nodes. Another is getting rid of recursive call to run an execution tree. It is perhaps the same to what mentioned as "data-centric processing" in a previous threads [1], [2], but I'd like to I pay attention on the aspect of "enableing to resume execution tree from arbitrary leaf node". So I'm considering to realize it still in one-tuple-by-one manner instead of collecting all tuples of a leaf node first. Even though I'm not sure it is doable. [1] https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com [2] https://www.postgresql.org/message-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Another is getting rid of recursive call to run an execution > tree. That happens to be exactly what Andres did for expression evaluation in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think generalizing that to include the plan tree as well as expression trees is likely to be the long-term way forward here. Unfortunately, that's probably another gigantic patch (that should probably be written by Andres). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for the comment. At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobbZrBPb7cvFj3ACPX2A_qSEB4ughRmB5dkGPXUYx_E+Q@mail.gmail.com> > On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Another is getting rid of recursive call to run an execution > > tree. > > That happens to be exactly what Andres did for expression evaluation > in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think > generalizing that to include the plan tree as well as expression trees > is likely to be the long-term way forward here. I read it in the source tree. The patch implements converting expression tree to an intermediate expression then run it on a custom-made interpreter. Guessing from the word "upside down" from Andres, the whole thing will become source-driven. > Unfortunately, that's probably another gigantic patch (that > should probably be written by Andres). Yeah, but async executor on the current style of executor seems furtile work, or sitting until the patch comes is also waste of time. So I'm planning to include the following sutff in the next PoC patch. Even I'm not sure it can land on the coming Andres'patch. - Tuple passing outside call-stack. (I remember it was in the past of the thread around but not found) This should be included in the Andres' patch. - Give executor an ability to run from data-source (or driver) nodes to the root. I'm not sure this is included, but I suppose he is aiming this kind of thing. - Rebuid asynchronous execution on the upside-down executor. regrds, -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp> > > Unfortunately, that's probably another gigantic patch (that > > should probably be written by Andres). > > Yeah, but async executor on the current style of executor seems > furtile work, or sitting until the patch comes is also waste of > time. So I'm planning to include the following sutff in the next > PoC patch. Even I'm not sure it can land on the coming > Andres'patch. > > - Tuple passing outside call-stack. (I remember it was in the > past of the thread around but not found) > > This should be included in the Andres' patch. > > - Give executor an ability to run from data-source (or driver) > nodes to the root. > > I'm not sure this is included, but I suppose he is aiming this > kind of thing. > > - Rebuid asynchronous execution on the upside-down executor. Anyway, I modified ExecProcNode into push-up form and it *seems* working to some extent. But trigger and cursors are almost broken and several other regressions fail. Some nodes such like windowagg are terriblly difficult to change to this push-up form (using state machine). And of course it is terribly inefficient. I'm afraid that all of this turns out to be in vain. But anyway, and FWIW, I'll show the work to here after some cleansing work on it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170831.215236.135328985.horiguchi.kyotaro@lab.ntt.co.jp> > At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp> > > > Unfortunately, that's probably another gigantic patch (that > > > should probably be written by Andres). > > > > Yeah, but async executor on the current style of executor seems > > furtile work, or sitting until the patch comes is also waste of > > time. So I'm planning to include the following sutff in the next > > PoC patch. Even I'm not sure it can land on the coming > > Andres'patch. > > > > - Tuple passing outside call-stack. (I remember it was in the > > past of the thread around but not found) > > > > This should be included in the Andres' patch. > > > > - Give executor an ability to run from data-source (or driver) > > nodes to the root. > > > > I'm not sure this is included, but I suppose he is aiming this > > kind of thing. > > > > - Rebuid asynchronous execution on the upside-down executor. > > Anyway, I modified ExecProcNode into push-up form and it *seems* > working to some extent. But trigger and cursors are almost broken > and several other regressions fail. Some nodes such like > windowagg are terriblly difficult to change to this push-up form > (using state machine). And of course it is terribly inefficient. > > I'm afraid that all of this turns out to be in vain. But anyway, > and FWIW, I'll show the work to here after some cleansing work on > it. So, this is that. Maybe this is really a bad way to go. Top of the bads is it's terriblly hard to maintain because the behavior of the state machine constructed in this patch is hardly predictable so easily broken. During the 'cleansing work' I had many crash or infinite-loop and they were a bit hard to diagnose.. This will be soon broken by following commits. Anyway and, again FWIW, this is that. I'll leave this for a while (at least the period of this CF) and reconsider on async in different forms. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. Fully-asynchronous executor needs that every node is stateful and suspendable at the time of requesting for the next tuples to underneath nodes. I tried pure push-base executor but failed. After the miserable patch upthread, I finally managed to make executor nodes suspendable using computational jump and got rid of recursive calls of executor. But it runs about x10 slower for simple SeqScan case. (pgbench ran with 9% degradation.) It doesn't seem recoverable by handy improvements. So I gave up that. Then I returned to single-level asynchrony, in other words, the simple case with async-aware nodes just above async-capable nodes. The motive of using the framework in the previous patch was that we had degradation on the sync (or normal) code paths by polluting ExecProcNode with async stuff and as Tom's suggestion the node->ExecProcNode trick can isolate the async code path. The attached PoC patch theoretically has no impact on the normal code paths and just brings gain in async cases. (Additional members in PlanState made degradation seemingly comes from alignment, though.) But I haven't had enough stable result from performance test. Different builds from the same source code gives apparently different results... Anyway I'll show the best one in the several times run here. original(ms) patched(ms) gain(%) A: simple table scan : 9714.70 9656.73 0.6 B: local partitioning : 4119.44 4131.10 -0.3 C: single remote table : 9484.86 9141.89 3.7 D: sharding (single con) : 7114.34 6751.21 5.1 E: sharding (multi con) : 7166.56 1827.93 74.5 A and B are degradation checks, which are expected to show no degradation. C is the gain only by postgres_fdw's command presending on a remote table. D is the gain of sharding on a connection. The number of partitions/shards is 4. E is the gain using dedicate connection per shard. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. ---src/backend/libpq/pqcomm.c | 2 +-src/backend/storage/ipc/latch.c | 18 ++++++-src/backend/storage/lmgr/condition_variable.c| 2 +-src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++src/include/storage/latch.h | 4 +-src/include/utils/resowner_private.h | 8 ++++6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 754154b..d459f32 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m")));#endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1,MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4eb6e83..e6fc3dd 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@#include "storage/latch.h"#include "storage/pmsignal.h"#include "storage/shmem.h" +#include "utils/resowner_private.h"/* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout>= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents){ WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_eventmight need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL,"");#endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set;} @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) }#endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set);} diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index b4b7d28..182f759 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) /* Create a reusable WaitEventSet. */ if (cv_wait_event_set== NULL) { - cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1); + cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1); AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET,PGINVALID_SOCKET, MyLatch, NULL); } diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index bd19fad..d36481e 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArrayfilearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to locallocks. */ int nlocks; /* number of owned locks */ @@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);static void PrintSnapshotLeakWarning(Snapshotsnapshot);static void PrintFileLeakWarning(File file);static void PrintDSMLeakWarning(dsm_segment*seg); +static void PrintWESLeakWarning(WaitEventSet *events);/***************************************************************************** @@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->snapshotarr),PointerGetDatum(NULL)); ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner;} @@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, PrintDSMLeakWarning(res); dsm_detach(res); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->snapshotarr.nitems == 0); Assert(owner->filearr.nitems== 0); Assert(owner->dsmarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->snapshotarr)); ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner);} @@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg) elog(WARNING, "dynamic shared memory leak: segment %u stillreferenced", dsm_segment_handle(seg));} + +/* + * Make sure there is room for at least one more entry in a ResourceOwner's + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a43193c..997ee8d 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@#define LATCH_H#include <signal.h> +#include "utils/resowner.h"/* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);extern void SetLatch(volatile Latch *latch);extern voidResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents);extern void FreeWaitEventSet(WaitEventSet *set);externint AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index 2420b65..70b0bb9 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@#include "storage/dsm.h"#include "storage/fd.h" +#include "storage/latch.h"#include "storage/lock.h"#include "utils/catcache.h"#include "utils/plancache.h" @@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,extern void ResourceOwnerForgetDSM(ResourceOwnerowner, dsm_segment *); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); +#endif /* RESOWNER_PRIVATE_H */ -- 2.9.2 From 1b213d238c398dc77cb31cf2a92284c70d292e9e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:23:51 +0900 Subject: [PATCH 2/3] core side modification ---src/backend/executor/Makefile | 2 +-src/backend/executor/execAsync.c | 110 ++++++++++++++++++src/backend/executor/nodeAppend.c | 194 ++++++++++++++++++++++++++++++--src/backend/executor/nodeForeignscan.c | 22 +++-src/backend/optimizer/plan/createplan.c| 56 ++++++++-src/backend/postmaster/pgstat.c | 3 +src/include/executor/execAsync.h | 23 ++++src/include/executor/executor.h | 1 +src/include/executor/nodeForeignscan.h | 3 +src/include/foreign/fdwapi.h | 11 ++src/include/nodes/execnodes.h | 18 ++-src/include/nodes/plannodes.h | 2 +src/include/pgstat.h | 3 +-13 files changed, 428 insertions(+), 20 deletions(-)create mode 100644 src/backend/executor/execAsync.ccreatemode 100644 src/include/executor/execAsync.h diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index 083b20f..21f5ad0 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executortop_builddir = ../../..include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o\ execMain.o execParallel.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000..f7daed7 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,110 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + refind = (int *) palloc(refindsize * sizeof(int)); + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index bed9bb8..5355bb2 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,9 +59,11 @@#include "executor/execdebug.h"#include "executor/nodeAppend.h" +#include "executor/execAsync.h"#include "miscadmin.h"static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate);static bool exec_append_initialize_next(AppendState *appendstate); @@ -81,16 +83,16 @@ exec_append_initialize_next(AppendState *appendstate) /* * get information from the append node */ - whichplan = appendstate->as_whichplan; + whichplan = appendstate->as_whichsyncplan; - if (whichplan < 0) + if (whichplan < appendstate->as_nasyncplans) { /* * if scanning in reverse, we start at the last scanin the list and * then proceed back to the first.. in any case we inform ExecAppend * that we are atthe end of the line by returning FALSE */ - appendstate->as_whichplan = 0; + appendstate->as_whichsyncplan = appendstate->as_nasyncplans; return FALSE; } else if (whichplan >=appendstate->as_nplans) @@ -98,7 +100,7 @@ exec_append_initialize_next(AppendState *appendstate) /* * as above, end the scan if wego beyond the last scan in our list.. */ - appendstate->as_whichplan = appendstate->as_nplans - 1; + appendstate->as_whichsyncplan = appendstate->as_nplans - 1; return FALSE; } else @@ -128,7 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ListCell *lc; /* check for unsupportedflags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)); /* * Lock the non-leaf tables in the partition tree controlledby this node. @@ -151,6 +153,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->ps.ExecProcNode = ExecAppend; appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + appendstate->as_nasyncplans = node->nasyncplans; + appendstate->as_syncdone = (node->nasyncplans == nplans); + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* Choose async version of Exec function */ + if (appendstate->as_nasyncplans > 0) + appendstate->ps.ExecProcNode = ExecAppendAsync; + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); /* * Miscellaneous initialization @@ -173,11 +188,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) foreach(lc, node->appendplans) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; + + if (i < appendstate->as_nasyncplans) + sub_eflags |= EXEC_FLAG_ASYNC; - appendplanstates[i] = ExecInitNode(initNode, estate, eflags); + appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags); i++; } + /* if there's any async-capable subnode, use async-aware routine */ + if (appendstate->as_nasyncplans) + appendstate->ps.ExecProcNode = ExecAppendAsync; + /* * initialize output tuple type */ @@ -187,7 +210,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags) /* * initialize to scan first subplan */ - appendstate->as_whichplan = 0; + /* + * initialize to scan first synchronous subplan + */ + appendstate->as_whichsyncplan = appendstate->as_nasyncplans; exec_append_initialize_next(appendstate); returnappendstate; @@ -204,6 +230,8 @@ ExecAppend(PlanState *pstate){ AppendState *node = castNode(AppendState, pstate); + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -214,7 +242,7 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - subnode = node->appendplans[node->as_whichplan]; + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -237,9 +265,9 @@ ExecAppend(PlanState *pstate) * ExecInitAppend. */ if (ScanDirectionIsForward(node->ps.state->es_direction)) - node->as_whichplan++; + node->as_whichsyncplan++; else - node->as_whichplan--; + node->as_whichsyncplan--; if (!exec_append_initialize_next(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); @@ -247,6 +275,141 @@ ExecAppend(PlanState *pstate) }} +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async, + timeout); + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan in the appropriate direction. If no + * more subplans, return the empty slot set up for us by + * ExecInitAppend, unless there are async plans we have yet to finish. + */ + if (ScanDirectionIsForward(node->ps.state->es_direction)) + node->as_whichsyncplan++; + else + node->as_whichsyncplan--; + if (!exec_append_initialize_next(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} +/* ---------------------------------------------------------------- * ExecEndAppend * @@ -280,6 +443,15 @@ ExecReScanAppend(AppendState *node){ int i; + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -298,6 +470,6 @@ ExecReScanAppend(AppendState *node) if (subnode->chgParam == NULL) ExecReScan(subnode); } - node->as_whichplan = 0; + node->as_whichsyncplan = node->as_nasyncplans; exec_append_initialize_next(node);} diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index 20892d6..e851988 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck);} -/* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan*) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -388,3 +391,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node);} + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 792ea84..53eb56d 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual Index scanrelid, char *enrname);static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); -static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels); +static Append *make_append(List *appendplans, int nasyncplans, int referent, + List *tlist, List *partitioned_rels);static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -283,6 +284,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr*onconflict, int epqParam);static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path);/* @@ -1004,8 +1006,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path){ Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1040,7 +1046,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all childrenreturn the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* Classify as async-capable or not */ + if (is_async_capable_path(subpath)) + { + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } /* @@ -1050,7 +1067,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked toemit. */ - plan = make_append(subplans, tlist, best_path->partitioned_rels); + plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5281,7 +5300,8 @@ make_foreignscan(List *qptlist,}static Append * -make_append(List *appendplans, List *tlist, List *partitioned_rels) +make_append(List *appendplans, int nasyncplans, int referent, + List *tlist, List *partitioned_rels){ Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5292,6 +5312,8 @@ make_append(List *appendplans, List *tlist, List *partitioned_rels) plan->righttree = NULL; node->partitioned_rels= partitioned_rels; node->appendplans = appendplans; + node->nasyncplans = nasyncplans; + node->referent = referent; return node;} @@ -6628,3 +6650,27 @@ is_projection_capable_plan(Plan *plan) } return true;} + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 3a0b49c..4c6571e 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000..5fd67d9 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 37fd6b2..2ab9d72 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -63,6 +63,7 @@#define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */#define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */#define EXEC_FLAG_WITH_NO_DATA 0x0080 /* relscannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution *//* Hook for plugins to get control in ExecutorStart()*/ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index 0354c2c..fed46d7 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,extern void ExecForeignScanInitializeWorker(ForeignScanState*node, shm_toc *toc);extern void ExecShutdownForeignScan(ForeignScanState*node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit);#endif /* NODEFOREIGNSCAN_H*/ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index 04e43cc..566236b 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,typedef List *(*ReparameterizeForeignPathByChild_function)(PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit);/* * FdwRoutine is the struct returned by a foreign-datawrapper's handler @@ -182,6 +187,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_functionEndForeignScan; @@ -232,6 +238,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_functionReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index c461134..7f663eb 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -840,6 +840,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for allPlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; +typedef struct PlanState{ NodeTag type; @@ -880,6 +886,9 @@ typedef struct PlanState TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */ ExprContext*ps_ExprContext; /* node's expression-evaluation context */ ProjectionInfo *ps_ProjInfo; /* info fordoing tuple projection */ + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */} PlanState;/* ---------------- @@ -1003,7 +1012,13 @@ typedef struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_nasyncplans; /* # of async-capable children */ + int as_whichsyncplan; /* which sync plan is being executed */ + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */} AppendState;/* ---------------- @@ -1546,6 +1561,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information*/ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */} ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index a382331..e0eccc8 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -248,6 +248,8 @@ typedef struct Append /* RT indexes of non-leaf tables in a partition tree */ List *partitioned_rels; List *appendplans; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */} Append;/* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 089b7c3..fe9d39c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -816,7 +816,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT} WaitEventIPC;/* ---------- -- 2.9.2 From 9f6a16ef7f7d1a38353216191641deb0d3ea58e7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw ---contrib/postgres_fdw/connection.c | 26 ++contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---contrib/postgres_fdw/postgres_fdw.c | 484 +++++++++++++++++++++----contrib/postgres_fdw/postgres_fdw.h | 2 +contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-5 files changed, 522 insertions(+), 138 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index be4ec07..00301d0 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash valueof user mapping OID */ + void *storage; /* connection specific storage */} ConnCacheEntry;/* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection%p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid,user->userid); + entry->storage = NULL; } /* @@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)}/* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + +/* * Connect to remote server using specified server and user mapping properties. */static PGconn * diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 4339bbf..2a0a662 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6512,7 +6512,7 @@ INSERT INTO a(aa) VALUES('aaaaa');INSERT INTO b(aa) VALUES('bbb');INSERT INTO b(aa) VALUES('bbbb');INSERTINTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+------- a | aaa @@ -6540,7 +6540,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6568,7 +6568,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6596,7 +6596,7 @@ SELECT tableoid::regclass, * FROM ONLY a;(3 rows)UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | newtoo @@ -6662,35 +6662,40 @@ insert into bar2 values(3,33,33);insert into bar2 values(4,44,44);insert into bar2 values(7,77,77);explain(verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for update; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid InnerUnique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid,foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scanon public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+---- 1 | 11 @@ -6700,35 +6705,40 @@ select * from bar where f1 in (select f1 from foo) for update;(4 rows)explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for share; + QUERY PLAN +---------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid InnerUnique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid,foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scanon public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+---- 1 | 11 @@ -6758,11 +6768,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid,foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scanon public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Hash Join Output: bar2.f1,(bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid Inner Unique: true @@ -6776,11 +6786,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid,foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scanon public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1(39 rows)update bar set f2 = f2 + 100 wheref1 in (select f1 from foo); @@ -6811,16 +6821,16 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) HashCond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output:ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Hash Output: bar.f1, bar.f2,bar.ctid -> Seq Scan on public.bar @@ -6838,16 +6848,16 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) RemoteSQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)(45 rows)update bar set f2 = f2 + 100 @@ -6998,27 +7008,33 @@ delete from foo where f1 < 5 returning *;(5 rows)explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; - QUERY PLAN ------------------------------------------------------------------------------- - Update on public.bar - Output: bar.f1, bar.f2 - Update on public.bar - Foreign Update on public.bar2 - -> Seq Scan on public.bar - Output: bar.f1, (bar.f2 + 100), bar.ctid - -> Foreign Update on public.bar2 - Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 -(8 rows) +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Output: u.f1, u.f2 + Sort Key: u.f1 + CTE u + -> Update on public.bar + Output: bar.f1, bar.f2 + Update on public.bar + Foreign Update on public.bar2 + -> Seq Scan on public.bar + Output: bar.f1, (bar.f2 + 100), bar.ctid + -> Foreign Update on public.bar2 + Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 + -> CTE Scan on u + Output: u.f1, u.f2 +(14 rows) -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2 ----+----- 1 | 311 2 | 322 - 6 | 266 3 | 333 4 | 344 + 6 | 266 7 | 277(6 rows) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index fb65e2e..0688504 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@#include "commands/defrem.h"#include "commands/explain.h"#include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h"#include "foreign/fdwapi.h"#include "funcapi.h"#include "miscadmin.h" @@ -34,6 +36,7 @@#include "optimizer/var.h"#include "optimizer/tlist.h"#include "parser/parsetree.h" +#include "pgstat.h"#include "utils/builtins.h"#include "utils/guc.h"#include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC;/* If no remote estimates, assume a sort costs 20% extra */#define DEFAULT_FDW_SORT_MULTIPLIER1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) +/* * Indexes of FDW-private information stored in fdw_private lists. * @@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex};/* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *current_owner; /* The node currently running a query + * on this connection*/ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + +/* * Execution state of a foreign scan using postgres_fdw. */typedef struct PgFdwScanState{ + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreigntable. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tupledescriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers*/ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed toquery */ @@ -150,6 +173,13 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch*/ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if lastfetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool async_waiting; /* true if requesting the parent to wait */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* A waiting node at the end of a waiting + * list. Maintained only by the current + * owner of the connection */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +193,11 @@ typedef struct PgFdwScanState */typedef struct PgFdwModifyState{ + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreigntable */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution*/ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement,if created */ /* extracted fdw_private data */ @@ -190,6 +220,7 @@ typedef struct PgFdwModifyState */typedef struct PgFdwDirectModifyState{ + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreigntable */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);static TupleTableSlot *postgresIterateForeignScan(ForeignScanState*node);static void postgresReScanForeignScan(ForeignScanState *node);static voidpostgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node);static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKindstage, RelOptInfo *input_rel, RelOptInfo *output_rel); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit);/* * Helper functions @@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass*ec, EquivalenceMember *em, void *arg);static void create_cursor(ForeignScanState*node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn); +static void absorb_current_result(ForeignScanState *node);static void close_cursor(PGconn *conn, unsigned int cursor_number);staticvoid prepare_foreign_modify(PgFdwModifyState *fmstate);static const char **convert_prep_stmt_params(PgFdwModifyState*fmstate, @@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets= postgresAddForeignUpdateTargets; @@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths= postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine);} @@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreignserver. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->async_waiting = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)static TupleTableSlot *postgresIterateForeignScan(ForeignScanState*node){ - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); - - /* * Get some more tuples, if we've run out. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ + ForeignScanState *next_conn_owner = node; + + /* This node has sent a query on this connection */ + if (fsstate->s.connpriv->current_owner == node) + { + /* Check if the result is available */ + if (PQisBusy(fsstate->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(fsstate->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (node->fs_async && !(rc & WL_SOCKET_READABLE)) + { + /* + * This node is not ready yet. Tell the caller to wait. + */ + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + return ExecClearTuple(slot); + } + } + + Assert(fsstate->async_waiting); + fsstate->async_waiting = false; + fetch_received_data(node); + + /* + * If someone is waiting this node on the same connection, let the + * first waiter be the next owner of this connection. + */ + if (fsstate->waiter) + { + PgFdwScanState *next_owner_state; + + next_conn_owner = fsstate->waiter; + next_owner_state = GetPgFdwScanState(next_conn_owner); + fsstate->waiter = NULL; + + /* + * only the current owner is responsible to maintain the shortcut + * to the last waiter + */ + next_owner_state->last_waiter = fsstate->last_waiter; + + /* + * for simplicity, last_waiter points itself on a node that no one + * is waiting for. + */ + fsstate->last_waiter = node; + } + } + else if (fsstate->s.connpriv->current_owner && + !GetPgFdwScanState(node)->eof_reached) + { + /* + * Anyone else is holding this connection and we want this node to + * run later. Add myself to the tail of the waiters' list then + * return not-ready. To avoid scanning through the waiters' list, + * the current owner is to maintain the shortcut to the last + * waiter. + */ + PgFdwScanState *conn_owner_state = + GetPgFdwScanState(fsstate->s.connpriv->current_owner); + ForeignScanState *last_waiter = conn_owner_state->last_waiter; + PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter); + + last_waiter_state->waiter = node; + conn_owner_state->last_waiter = node; + + /* Register the node to the async-waiting node list */ + Assert(!GetPgFdwScanState(node)->async_waiting); + + GetPgFdwScanState(node)->async_waiting = true; + + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; + return ExecClearTuple(slot); + } + + /* At this time no node is running on the connection */ + Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner + == NULL); + /* + * Send the next request for the next owner of this connection if + * needed. + */ + if (!GetPgFdwScanState(next_conn_owner)->eof_reached) + { + PgFdwScanState *next_owner_state = + GetPgFdwScanState(next_conn_owner); + + request_more_data(next_conn_owner); + + /* Register the node to the async-waiting node list */ + if (!next_owner_state->async_waiting) + next_owner_state->async_waiting = true; + + if (!next_conn_owner->fs_async) + fetch_received_data(next_conn_owner); + } + + + /* + * If we haven't received a result for the given node this time, + * return with no tuple to give way to other nodes. + */ if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; return ExecClearTuple(slot); + } } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)static voidpostgresReScanForeignScan(ForeignScanState*node){ - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + /* Absorb the ramining result */ + absorb_current_result(node); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be carefulnot to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)static voidpostgresEndForeignScan(ForeignScanState*node){ - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulationof cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */}/* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* Absorb the ramining result */ + absorb_current_result(node); +} + +/* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table */ @@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statementnot made yet */ /* Deconstruct fdw_private data. */ @@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check forsuccess. @@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful notto throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, andfetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check forsuccess. @@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful notto throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, andfetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check forsuccess. @@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful notto throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, andfetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate, * We don't use a PG_TRY block here, so be carefulnot to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL;}/* @@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Initialize state variable */ dmstate->num_tuples= -1; /* -1 means not set yet */ @@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, executethe statement. */ if (dmstate->num_tuples == -1) + { + vacate_connection((PgFdwState *)dmstate); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remoteconnection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* MemoryContext will be deleted automatically. */} @@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoDatasql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist= NIL; @@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user,false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + vacate_connection(&tmpstate); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,static voidcreate_cursor(ForeignScanState*node){ - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node) * Fetch some more rows from the node's cursor. */static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node){ - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* The connection should be vacant */ + Assert(fsstate->s.connpriv->current_owner == NULL); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->current_owner = node; +} + +/* + * Fetch some more rows from the node's cursor. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + /* I should be the current connection owner */ + Assert(fsstate->s.connpriv->current_owner == node); + /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function.*/ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res)!= PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /*Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->current_owner = NULL; if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->current_owner = NULL; + MemoryContextSwitchTo(oldcontext);}/* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *owner; + + if (connpriv == NULL || connpriv->current_owner == NULL) + return; + + /* + * let the current connection owner read the result for the running query + */ + owner = connpriv->current_owner; + fetch_received_data(owner); + + /* Clear the waiting list */ + while (owner) + { + PgFdwScanState *fsstate = GetPgFdwScanState(owner); + + fsstate->last_waiter = NULL; + owner = fsstate->waiter; + fsstate->waiter = NULL; + } +} + +/* + * Absorb the result of the current query. + */ +static void +absorb_current_result(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *owner = fsstate->s.connpriv->current_owner; + + if (owner) + { + PgFdwScanState *target_state = GetPgFdwScanState(owner); + PGconn *conn = target_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->async_waiting = false; + } +} +/* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguousto the remote server. * @@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement.*/ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this moduleare simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check forsuccess. @@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be carefulnot to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action showsthat the prepare has been done. */ @@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assumingthat the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL,0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check forsuccess. @@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful notto throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the joinrelation */} +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection owner. Elsewise + * another node on this connection is the owner. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->current_owner == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + +/* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a sideeffect, save information we obtain in @@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res, PgFdwScanState *fdw_sstate; Assert(fsstate); - fdw_sstate = (PgFdwScanState *) fsstate->fdw_state; + fdw_sstate = GetPgFdwScanState(fsstate); tupdesc = fdw_sstate->tupdesc; } diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index 788b003..41ac1d2 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode*/ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINingForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);/* in connection.c */extern PGconn *GetConnection(UserMapping*user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);extern void ReleaseConnection(PGconn *conn);externunsigned int GetCursorNumber(PGconn *conn);extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index ddfec79..56aae91 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');INSERT INTO b(aa) VALUES('bbbb');INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, * FROMONLY a;UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, * FROMONLY a;UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, * FROMONLY a;UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;SELECT tableoid::regclass, * FROM b;SELECT tableoid::regclass, * FROMONLY a; @@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);insert into bar2 values(7,77,77);explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update;explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share;-- Check UPDATE with inherited target and an inheritedsource tableexplain (verbose, costs off) @@ -1653,8 +1653,8 @@ explain (verbose, costs off)delete from foo where f1 < 5 returning *;delete from foo where f1 < 5 returning*;explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;drop table foo cascade;drop table bar cascade; -- 2.9.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, At Fri, 20 Oct 2017 17:37:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171020.173707.12913619.horiguchi.kyotaro@lab.ntt.co.jp> > The attached PoC patch theoretically has no impact on the normal > code paths and just brings gain in async cases. The parallel append just committed hit this and the attached are the rebased version to the current HEAD. The result of a concise performance test follows. patched(ms) unpatched(ms) gain(%) A: simple table scan : 3562.32 3444.81 -3.4 B: local partitioning : 1451.25 1604.38 9.5 C: single remote table : 8818.92 9297.76 5.1 D: sharding (single con) : 5966.14 6646.73 10.2 E: sharding (multi con) : 1802.25 6515.49 72.3 > A and B are degradation checks, which are expected to show no > degradation. C is the gain only by postgres_fdw's command > presending on a remote table. D is the gain of sharding on a > connection. The number of partitions/shards is 4. E is the gain > using dedicate connection per shard. Test A is accelerated by parallel sequential scan. Introducing parallel append accelerates test B. Comparing A and B, I doubt that degradation is stably measurable at least my environment but I believe that there's no degradation theoreticaly. The test C to E still shows apparent gain. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From b1aff3362b983975003d8a60f9b3593cb2fa62fc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c | 2 +- src/backend/storage/ipc/latch.c | 18 ++++++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 ++++ 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index fc15181..7c4077a 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4eb6e83..e6fc3dd 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index b4b7d28..182f759 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) /* Create a reusable WaitEventSet. */ if (cv_wait_event_set == NULL) { - cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1); + cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1); AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); } diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index 4c35ccf..e00e39c 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc); static void PrintSnapshotLeakWarning(Snapshot snapshot); static void PrintFileLeakWarning(File file); static void PrintDSMLeakWarning(dsm_segment *seg); +static void PrintWESLeakWarning(WaitEventSet *events); /***************************************************************************** @@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL)); ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner; } @@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, PrintDSMLeakWarning(res); dsm_detach(res); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->snapshotarr.nitems == 0); Assert(owner->filearr.nitems == 0); Assert(owner->dsmarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->snapshotarr)); ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner); } @@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg) elog(WARNING, "dynamic shared memory leak: segment %u still referenced", dsm_segment_handle(seg)); } + +/* + * Make sure there is room for at least one more entry in a ResourceOwner's + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a43193c..997ee8d 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@ #define LATCH_H #include <signal.h> +#include "utils/resowner.h" /* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch); extern void SetLatch(volatile Latch *latch); extern void ResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents); extern void FreeWaitEventSet(WaitEventSet *set); extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index 2420b65..70b0bb9 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@ #include "storage/dsm.h" #include "storage/fd.h" +#include "storage/latch.h" #include "storage/lock.h" #include "utils/catcache.h" #include "utils/plancache.h" @@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner, extern void ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); + #endif /* RESOWNER_PRIVATE_H */ -- 2.9.2 From 9c1273a4868bed5eb0991f842296cb89c10470bc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:23:51 +0900 Subject: [PATCH 2/3] core side modification --- src/backend/executor/Makefile | 2 +- src/backend/executor/execAsync.c | 110 ++++++++++++++ src/backend/executor/nodeAppend.c | 247 +++++++++++++++++++++++++++----- src/backend/executor/nodeForeignscan.c | 22 ++- src/backend/optimizer/plan/createplan.c | 62 +++++++- src/backend/postmaster/pgstat.c | 3 + src/include/executor/execAsync.h | 23 +++ src/include/executor/executor.h | 1 + src/include/executor/nodeForeignscan.h | 3 + src/include/foreign/fdwapi.h | 11 ++ src/include/nodes/execnodes.h | 18 ++- src/include/nodes/plannodes.h | 2 + src/include/pgstat.h | 3 +- 13 files changed, 462 insertions(+), 45 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index cc09895..8ad2adf 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executor top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o \ execMain.o execParallel.o execPartition.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000..f7daed7 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,110 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + refind = (int *) palloc(refindsize * sizeof(int)); + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 0e93713..f21ab36 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,6 +59,7 @@ #include "executor/execdebug.h" #include "executor/nodeAppend.h" +#include "executor/execAsync.h" #include "miscadmin.h" /* Shared state for parallel-aware Append. */ @@ -79,6 +80,7 @@ struct ParallelAppendState #define INVALID_SUBPLAN_INDEX -1 static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate); static bool choose_next_subplan_locally(AppendState *node); static bool choose_next_subplan_for_leader(AppendState *node); static bool choose_next_subplan_for_worker(AppendState *node); @@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ListCell *lc; /* check for unsupported flags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC))); /* * Lock the non-leaf tables in the partition tree controlled by this node. @@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->ps.ExecProcNode = ExecAppend; appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + appendstate->as_nasyncplans = node->nasyncplans; + appendstate->as_syncdone = (node->nasyncplans == nplans); + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* Choose async version of Exec function */ + if (appendstate->as_nasyncplans > 0) + appendstate->ps.ExecProcNode = ExecAppendAsync; + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); /* * Miscellaneous initialization @@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags) foreach(lc, node->appendplans) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; + + if (i < appendstate->as_nasyncplans) + sub_eflags |= EXEC_FLAG_ASYNC; - appendplanstates[i] = ExecInitNode(initNode, estate, eflags); + appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags); i++; } + /* if there's any async-capable subnode, use async-aware routine */ + if (appendstate->as_nasyncplans) + appendstate->ps.ExecProcNode = ExecAppendAsync; + /* * initialize output tuple type */ ExecAssignResultTypeFromTL(&appendstate->ps); appendstate->ps.ps_ProjInfo = NULL; - /* - * Parallel-aware append plans must choose the first subplan to execute by - * looking at shared memory, but non-parallel-aware append plans can - * always start with the first subplan. - */ - appendstate->as_whichplan = - appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; + if (appendstate->ps.plan->parallel_aware) + { + /* + * Parallel-aware append plans must choose the first subplan to + * execute by looking at shared memory, but non-parallel-aware append + * plans can always start with the first subplan. + */ - /* If parallel-aware, this will be overridden later. */ - appendstate->choose_next_subplan = choose_next_subplan_locally; + appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX; + + /* If parallel-aware, this will be overridden later. */ + appendstate->choose_next_subplan = choose_next_subplan_locally; + } + else + { + appendstate->as_whichsyncplan = 0; + + /* + * initialize to scan first synchronous subplan + */ + appendstate->as_whichsyncplan = appendstate->as_nasyncplans; + appendstate->choose_next_subplan = choose_next_subplan_locally; + } return appendstate; } @@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate) AppendState *node = castNode(AppendState, pstate); /* If no subplan has been chosen, we must choose one before proceeding. */ - if (node->as_whichplan == INVALID_SUBPLAN_INDEX && + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && !node->choose_next_subplan(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans); - subnode = node->appendplans[node->as_whichplan]; + Assert(node->as_whichsyncplan >= 0 && + node->as_whichsyncplan < node->as_nplans); + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate) } } +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async, + timeout); + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan in the appropriate direction. If no + * more subplans, return the empty slot set up for us by + * ExecInitAppend, unless there are async plans we have yet to finish. + */ + if (!node->choose_next_subplan(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} + /* ---------------------------------------------------------------- * ExecEndAppend * @@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node) { int i; + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node) ExecReScan(subnode); } - node->as_whichplan = + node->as_whichsyncplan = node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; } @@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt) static bool choose_next_subplan_locally(AppendState *node) { - int whichplan = node->as_whichplan; + int whichplan = node->as_whichsyncplan; /* We should never see INVALID_SUBPLAN_INDEX in this case. */ Assert(whichplan >= 0 && whichplan <= node->as_nplans); @@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node) { if (whichplan >= node->as_nplans - 1) return false; - node->as_whichplan++; + node->as_whichsyncplan++; } else { if (whichplan <= 0) return false; - node->as_whichplan--; + node->as_whichsyncplan--; } return true; @@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) { /* Mark just-completed subplan as finished. */ - node->as_pstate->pa_finished[node->as_whichplan] = true; + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; } else { /* Start with last subplan. */ - node->as_whichplan = node->as_nplans - 1; + node->as_whichsyncplan = node->as_nplans - 1; } /* Loop until we find a subplan to execute. */ - while (pstate->pa_finished[node->as_whichplan]) + while (pstate->pa_finished[node->as_whichsyncplan]) { - if (node->as_whichplan == 0) + if (node->as_whichsyncplan == 0) { pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; LWLockRelease(&pstate->pa_lock); return false; } - node->as_whichplan--; + node->as_whichsyncplan--; } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); @@ -464,8 +643,8 @@ choose_next_subplan_for_worker(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); /* Mark just-completed subplan as finished. */ - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; /* If all the plans are already done, we have nothing to do */ if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX) @@ -490,10 +669,10 @@ choose_next_subplan_for_worker(AppendState *node) else { /* At last plan, no partial plans, arrange to bail out. */ - pstate->pa_next_plan = node->as_whichplan; + pstate->pa_next_plan = node->as_whichsyncplan; } - if (pstate->pa_next_plan == node->as_whichplan) + if (pstate->pa_next_plan == node->as_whichsyncplan) { /* We've tried everything! */ pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; @@ -503,7 +682,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Pick the plan we found, and advance pa_next_plan one more time. */ - node->as_whichplan = pstate->pa_next_plan++; + node->as_whichsyncplan = pstate->pa_next_plan++; if (pstate->pa_next_plan >= node->as_nplans) { if (append->first_partial_plan < node->as_nplans) @@ -519,8 +698,8 @@ choose_next_subplan_for_worker(AppendState *node) } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index dc6cfcf..afc8a58 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck); } - /* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan *) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node); } + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index f6c83d0..402db1e 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); static Append *make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels); + int nasyncplans, int referent, + List *tlist, List *partitioned_rels); static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path); /* @@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) { Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all children return the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* + * Classify as async-capable or not. If we have decided to run the + * chidlren in parallel, we cannot any one of them run asynchronously. + */ + if (!best_path->path.parallel_safe && is_async_capable_path(subpath)) + { + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } /* @@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked to emit. */ - plan = make_append(subplans, best_path->first_partial_path, - tlist, best_path->partitioned_rels); + plan = make_append(list_concat(asyncplans, syncplans), + best_path->first_partial_path, nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5296,8 +5318,8 @@ make_foreignscan(List *qptlist, } static Append * -make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels) +make_append(List *appendplans, int first_partial_plan, int nasyncplans, + int referent, List *tlist, List *partitioned_rels) { Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5309,6 +5331,8 @@ make_append(List *appendplans, int first_partial_plan, node->partitioned_rels = partitioned_rels; node->appendplans = appendplans; node->first_partial_plan = first_partial_plan; + node->nasyncplans = nasyncplans; + node->referent = referent; return node; } @@ -6646,3 +6670,27 @@ is_projection_capable_plan(Plan *plan) } return true; } + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 5c256ff..09ea33b 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000..5fd67d9 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index b5578f5..bd622c9 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -63,6 +63,7 @@ #define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */ #define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */ #define EXEC_FLAG_WITH_NO_DATA 0x0080 /* rel scannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution */ /* Hook for plugins to get control in ExecutorStart() */ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index 152abf0..1d95e39 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node, extern void ExecForeignScanInitializeWorker(ForeignScanState *node, ParallelWorkerContext *pwcxt); extern void ExecShutdownForeignScan(ForeignScanState *node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); #endif /* NODEFOREIGNSCAN_H */ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index 04e43cc..566236b 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root, typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit); /* * FdwRoutine is the struct returned by a foreign-data wrapper's handler @@ -182,6 +187,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; @@ -232,6 +238,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 1a35c5c..c049251 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -843,6 +843,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for all PlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; + typedef struct PlanState { NodeTag type; @@ -883,6 +889,9 @@ typedef struct PlanState TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */ ExprContext *ps_ExprContext; /* node's expression-evaluation context */ ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */ + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */ } PlanState; /* ---------------- @@ -1012,10 +1021,16 @@ struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_nasyncplans; /* # of async-capable children */ ParallelAppendState *as_pstate; /* parallel coordination info */ + int as_whichsyncplan; /* which sync plan is being executed */ Size pstate_len; /* size of parallel coordination info */ bool (*choose_next_subplan) (AppendState *); + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */ }; /* ---------------- @@ -1566,6 +1581,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information */ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */ } ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index 02fb366..a6df261 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -249,6 +249,8 @@ typedef struct Append List *partitioned_rels; List *appendplans; int first_partial_plan; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */ } Append; /* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 089b7c3..fe9d39c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -816,7 +816,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT } WaitEventIPC; /* ---------- -- 2.9.2 From d0882fbc09fce447e642278292b70e4a6b73575e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw --- contrib/postgres_fdw/connection.c | 26 ++ contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++--- contrib/postgres_fdw/postgres_fdw.c | 484 +++++++++++++++++++++---- contrib/postgres_fdw/postgres_fdw.h | 2 + contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +- 5 files changed, 522 insertions(+), 138 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 4fbf043..646085f 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ + void *storage; /* connection specific storage */ } ConnCacheEntry; /* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid, user->userid); + entry->storage = NULL; } /* @@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt) } /* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + +/* * Connect to remote server using specified server and user mapping properties. */ static PGconn * diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 683d641..3b4eefa 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa'); INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+------- a | aaa @@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | newtoo @@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33); insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for update; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+---- 1 | 11 @@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update; (4 rows) explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for share; + QUERY PLAN +---------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+---- 1 | 11 @@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid Inner Unique: true @@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 (39 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); @@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Hash Output: bar.f1, bar.f2, bar.ctid -> Seq Scan on public.bar @@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) (45 rows) update bar set f2 = f2 + 100 @@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *; (5 rows) explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; - QUERY PLAN ------------------------------------------------------------------------------- - Update on public.bar - Output: bar.f1, bar.f2 - Update on public.bar - Foreign Update on public.bar2 - -> Seq Scan on public.bar - Output: bar.f1, (bar.f2 + 100), bar.ctid - -> Foreign Update on public.bar2 - Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 -(8 rows) +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Output: u.f1, u.f2 + Sort Key: u.f1 + CTE u + -> Update on public.bar + Output: bar.f1, bar.f2 + Update on public.bar + Foreign Update on public.bar2 + -> Seq Scan on public.bar + Output: bar.f1, (bar.f2 + 100), bar.ctid + -> Foreign Update on public.bar2 + Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 + -> CTE Scan on u + Output: u.f1, u.f2 +(14 rows) -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2 ----+----- 1 | 311 2 | 322 - 6 | 266 3 | 333 4 | 344 + 6 | 266 7 | 277 (6 rows) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index fb65e2e..0688504 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -34,6 +36,7 @@ #include "optimizer/var.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" +#include "pgstat.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC; /* If no remote estimates, assume a sort costs 20% extra */ #define DEFAULT_FDW_SORT_MULTIPLIER 1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) + /* * Indexes of FDW-private information stored in fdw_private lists. * @@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex }; /* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *current_owner; /* The node currently running a query + * on this connection*/ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + +/* * Execution state of a foreign scan using postgres_fdw. */ typedef struct PgFdwScanState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed to query */ @@ -150,6 +173,13 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch */ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if last fetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool async_waiting; /* true if requesting the parent to wait */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* A waiting node at the end of a waiting + * list. Maintained only by the current + * owner of the connection */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +193,11 @@ typedef struct PgFdwScanState */ typedef struct PgFdwModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement, if created */ /* extracted fdw_private data */ @@ -190,6 +220,7 @@ typedef struct PgFdwModifyState */ typedef struct PgFdwDirectModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags); static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node); static void postgresReScanForeignScan(ForeignScanState *node); static void postgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node); static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage, RelOptInfo *input_rel, RelOptInfo *output_rel); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); /* * Helper functions @@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass *ec, EquivalenceMember *em, void *arg); static void create_cursor(ForeignScanState *node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn); +static void absorb_current_result(ForeignScanState *node); static void close_cursor(PGconn *conn, unsigned int cursor_number); static void prepare_foreign_modify(PgFdwModifyState *fmstate); static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate, @@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets; @@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine); } @@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->async_waiting = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) static TupleTableSlot * postgresIterateForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); - - /* * Get some more tuples, if we've run out. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ + ForeignScanState *next_conn_owner = node; + + /* This node has sent a query on this connection */ + if (fsstate->s.connpriv->current_owner == node) + { + /* Check if the result is available */ + if (PQisBusy(fsstate->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(fsstate->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (node->fs_async && !(rc & WL_SOCKET_READABLE)) + { + /* + * This node is not ready yet. Tell the caller to wait. + */ + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + return ExecClearTuple(slot); + } + } + + Assert(fsstate->async_waiting); + fsstate->async_waiting = false; + fetch_received_data(node); + + /* + * If someone is waiting this node on the same connection, let the + * first waiter be the next owner of this connection. + */ + if (fsstate->waiter) + { + PgFdwScanState *next_owner_state; + + next_conn_owner = fsstate->waiter; + next_owner_state = GetPgFdwScanState(next_conn_owner); + fsstate->waiter = NULL; + + /* + * only the current owner is responsible to maintain the shortcut + * to the last waiter + */ + next_owner_state->last_waiter = fsstate->last_waiter; + + /* + * for simplicity, last_waiter points itself on a node that no one + * is waiting for. + */ + fsstate->last_waiter = node; + } + } + else if (fsstate->s.connpriv->current_owner && + !GetPgFdwScanState(node)->eof_reached) + { + /* + * Anyone else is holding this connection and we want this node to + * run later. Add myself to the tail of the waiters' list then + * return not-ready. To avoid scanning through the waiters' list, + * the current owner is to maintain the shortcut to the last + * waiter. + */ + PgFdwScanState *conn_owner_state = + GetPgFdwScanState(fsstate->s.connpriv->current_owner); + ForeignScanState *last_waiter = conn_owner_state->last_waiter; + PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter); + + last_waiter_state->waiter = node; + conn_owner_state->last_waiter = node; + + /* Register the node to the async-waiting node list */ + Assert(!GetPgFdwScanState(node)->async_waiting); + + GetPgFdwScanState(node)->async_waiting = true; + + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; + return ExecClearTuple(slot); + } + + /* At this time no node is running on the connection */ + Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner + == NULL); + /* + * Send the next request for the next owner of this connection if + * needed. + */ + if (!GetPgFdwScanState(next_conn_owner)->eof_reached) + { + PgFdwScanState *next_owner_state = + GetPgFdwScanState(next_conn_owner); + + request_more_data(next_conn_owner); + + /* Register the node to the async-waiting node list */ + if (!next_owner_state->async_waiting) + next_owner_state->async_waiting = true; + + if (!next_conn_owner->fs_async) + fetch_received_data(next_conn_owner); + } + + + /* + * If we haven't received a result for the given node this time, + * return with no tuple to give way to other nodes. + */ if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; return ExecClearTuple(slot); + } } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node) static void postgresReScanForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + /* Absorb the ramining result */ + absorb_current_result(node); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node) static void postgresEndForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulation of cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */ } /* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* Absorb the ramining result */ + absorb_current_result(node); +} + +/* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table */ @@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statement not made yet */ /* Deconstruct fdw_private data. */ @@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL; } /* @@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Initialize state variable */ dmstate->num_tuples = -1; /* -1 means not set yet */ @@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, execute the statement. */ if (dmstate->num_tuples == -1) + { + vacate_connection((PgFdwState *)dmstate); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remote connection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* MemoryContext will be deleted automatically. */ } @@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoData sql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist = NIL; @@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user, false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + vacate_connection(&tmpstate); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, static void create_cursor(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node) * Fetch some more rows from the node's cursor. */ static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* The connection should be vacant */ + Assert(fsstate->s.connpriv->current_owner == NULL); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->current_owner = node; +} + +/* + * Fetch some more rows from the node's cursor. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + /* I should be the current connection owner */ + Assert(fsstate->s.connpriv->current_owner == node); + /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function. */ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res) != PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /* Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->current_owner = NULL; if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->current_owner = NULL; + MemoryContextSwitchTo(oldcontext); } /* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *owner; + + if (connpriv == NULL || connpriv->current_owner == NULL) + return; + + /* + * let the current connection owner read the result for the running query + */ + owner = connpriv->current_owner; + fetch_received_data(owner); + + /* Clear the waiting list */ + while (owner) + { + PgFdwScanState *fsstate = GetPgFdwScanState(owner); + + fsstate->last_waiter = NULL; + owner = fsstate->waiter; + fsstate->waiter = NULL; + } +} + +/* + * Absorb the result of the current query. + */ +static void +absorb_current_result(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *owner = fsstate->s.connpriv->current_owner; + + if (owner) + { + PgFdwScanState *target_state = GetPgFdwScanState(owner); + PGconn *conn = target_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->async_waiting = false; + } +} +/* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguous to the remote server. * @@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement. */ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this module are simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action shows that the prepare has been done. */ @@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assuming that the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check for success. @@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the join relation */ } +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection owner. Elsewise + * another node on this connection is the owner. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->current_owner == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + + /* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a side effect, save information we obtain in @@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res, PgFdwScanState *fdw_sstate; Assert(fsstate); - fdw_sstate = (PgFdwScanState *) fsstate->fdw_state; + fdw_sstate = GetPgFdwScanState(fsstate); tupdesc = fdw_sstate->tupdesc; } diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index 788b003..41ac1d2 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode */ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINing ForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize); extern void ReleaseConnection(PGconn *conn); extern unsigned int GetCursorNumber(PGconn *conn); extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 3c3c5c7..cb9caa5 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; @@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) @@ -1653,8 +1653,8 @@ explain (verbose, costs off) delete from foo where f1 < 5 returning *; delete from foo where f1 < 5 returning *; explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; -- Test that UPDATE/DELETE with inherited target works with row-level triggers CREATE TRIGGER trig_row_before -- 2.9.2
At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp> > > The attached PoC patch theoretically has no impact on the normal > > code paths and just brings gain in async cases. > > The parallel append just committed hit this and the attached are > the rebased version to the current HEAD. The result of a concise > performance test follows. > > patched(ms) unpatched(ms) gain(%) > A: simple table scan : 3562.32 3444.81 -3.4 > B: local partitioning : 1451.25 1604.38 9.5 > C: single remote table : 8818.92 9297.76 5.1 > D: sharding (single con) : 5966.14 6646.73 10.2 > E: sharding (multi con) : 1802.25 6515.49 72.3 > > > A and B are degradation checks, which are expected to show no > > degradation. C is the gain only by postgres_fdw's command > > presending on a remote table. D is the gain of sharding on a > > connection. The number of partitions/shards is 4. E is the gain > > using dedicate connection per shard. > > Test A is accelerated by parallel sequential scan. Introducing > parallel append accelerates test B. Comparing A and B, I doubt > that degradation is stably measurable at least my environment but > I believe that there's no degradation theoreticaly. The test C to > E still shows apparent gain. > regards, The patch conflicts with 3cac0ec. This is the rebased version. -- Kyotaro Horiguchi NTT Open Source Software Center From be22b33b90abec93a2a609a1db4955e6910b2da0 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c | 2 +- src/backend/storage/ipc/latch.c | 18 ++++++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 ++++ 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index a4f6d4d..890972b 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index e6706f7..5457899 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index ef1d5ba..30edc8e 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) { WaitEventSet *new_event_set; - new_event_set = CreateWaitEventSet(TopMemoryContext, 2); + new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2); AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index e09a4f1..7ae8777 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc); static void PrintSnapshotLeakWarning(Snapshot snapshot); static void PrintFileLeakWarning(File file); static void PrintDSMLeakWarning(dsm_segment *seg); +static void PrintWESLeakWarning(WaitEventSet *events); /***************************************************************************** @@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL)); ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner; } @@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, PrintDSMLeakWarning(res); dsm_detach(res); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->snapshotarr.nitems == 0); Assert(owner->filearr.nitems == 0); Assert(owner->dsmarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->snapshotarr)); ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner); } @@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg) elog(WARNING, "dynamic shared memory leak: segment %u still referenced", dsm_segment_handle(seg)); } + +/* + * Make sure there is room for at least one more entry in a ResourceOwner's + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a4bcb48..838845a 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@ #define LATCH_H #include <signal.h> +#include "utils/resowner.h" /* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch); extern void SetLatch(volatile Latch *latch); extern void ResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents); extern void FreeWaitEventSet(WaitEventSet *set); extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index 22b377c..56f2059 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@ #include "storage/dsm.h" #include "storage/fd.h" +#include "storage/latch.h" #include "storage/lock.h" #include "utils/catcache.h" #include "utils/plancache.h" @@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner, extern void ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); + #endif /* RESOWNER_PRIVATE_H */ -- 2.9.2 From 885f62d89a93edbda44330c3ecc3a7ac08e302ea Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:23:51 +0900 Subject: [PATCH 2/3] core side modification --- src/backend/executor/Makefile | 2 +- src/backend/executor/execAsync.c | 110 ++++++++++++++ src/backend/executor/nodeAppend.c | 247 +++++++++++++++++++++++++++----- src/backend/executor/nodeForeignscan.c | 22 ++- src/backend/optimizer/plan/createplan.c | 62 +++++++- src/backend/postmaster/pgstat.c | 3 + src/include/executor/execAsync.h | 23 +++ src/include/executor/executor.h | 1 + src/include/executor/nodeForeignscan.h | 3 + src/include/foreign/fdwapi.h | 11 ++ src/include/nodes/execnodes.h | 18 ++- src/include/nodes/plannodes.h | 2 + src/include/pgstat.h | 3 +- 13 files changed, 462 insertions(+), 45 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index cc09895..8ad2adf 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executor top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o \ execMain.o execParallel.o execPartition.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000..f7daed7 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,110 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + refind = (int *) palloc(refindsize * sizeof(int)); + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 64a17fb..644af5b 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,6 +59,7 @@ #include "executor/execdebug.h" #include "executor/nodeAppend.h" +#include "executor/execAsync.h" #include "miscadmin.h" /* Shared state for parallel-aware Append. */ @@ -79,6 +80,7 @@ struct ParallelAppendState #define INVALID_SUBPLAN_INDEX -1 static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate); static bool choose_next_subplan_locally(AppendState *node); static bool choose_next_subplan_for_leader(AppendState *node); static bool choose_next_subplan_for_worker(AppendState *node); @@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ListCell *lc; /* check for unsupported flags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC))); /* * Lock the non-leaf tables in the partition tree controlled by this node. @@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->ps.ExecProcNode = ExecAppend; appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + appendstate->as_nasyncplans = node->nasyncplans; + appendstate->as_syncdone = (node->nasyncplans == nplans); + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* Choose async version of Exec function */ + if (appendstate->as_nasyncplans > 0) + appendstate->ps.ExecProcNode = ExecAppendAsync; + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); /* * Miscellaneous initialization @@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags) foreach(lc, node->appendplans) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; + + if (i < appendstate->as_nasyncplans) + sub_eflags |= EXEC_FLAG_ASYNC; - appendplanstates[i] = ExecInitNode(initNode, estate, eflags); + appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags); i++; } + /* if there's any async-capable subnode, use async-aware routine */ + if (appendstate->as_nasyncplans) + appendstate->ps.ExecProcNode = ExecAppendAsync; + /* * initialize output tuple type */ ExecAssignResultTypeFromTL(&appendstate->ps); appendstate->ps.ps_ProjInfo = NULL; - /* - * Parallel-aware append plans must choose the first subplan to execute by - * looking at shared memory, but non-parallel-aware append plans can - * always start with the first subplan. - */ - appendstate->as_whichplan = - appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; + if (appendstate->ps.plan->parallel_aware) + { + /* + * Parallel-aware append plans must choose the first subplan to + * execute by looking at shared memory, but non-parallel-aware append + * plans can always start with the first subplan. + */ - /* If parallel-aware, this will be overridden later. */ - appendstate->choose_next_subplan = choose_next_subplan_locally; + appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX; + + /* If parallel-aware, this will be overridden later. */ + appendstate->choose_next_subplan = choose_next_subplan_locally; + } + else + { + appendstate->as_whichsyncplan = 0; + + /* + * initialize to scan first synchronous subplan + */ + appendstate->as_whichsyncplan = appendstate->as_nasyncplans; + appendstate->choose_next_subplan = choose_next_subplan_locally; + } return appendstate; } @@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate) AppendState *node = castNode(AppendState, pstate); /* If no subplan has been chosen, we must choose one before proceeding. */ - if (node->as_whichplan == INVALID_SUBPLAN_INDEX && + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && !node->choose_next_subplan(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans); - subnode = node->appendplans[node->as_whichplan]; + Assert(node->as_whichsyncplan >= 0 && + node->as_whichsyncplan < node->as_nplans); + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate) } } +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async, + timeout); + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan in the appropriate direction. If no + * more subplans, return the empty slot set up for us by + * ExecInitAppend, unless there are async plans we have yet to finish. + */ + if (!node->choose_next_subplan(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} + /* ---------------------------------------------------------------- * ExecEndAppend * @@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node) { int i; + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node) ExecReScan(subnode); } - node->as_whichplan = + node->as_whichsyncplan = node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; } @@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt) static bool choose_next_subplan_locally(AppendState *node) { - int whichplan = node->as_whichplan; + int whichplan = node->as_whichsyncplan; /* We should never see INVALID_SUBPLAN_INDEX in this case. */ Assert(whichplan >= 0 && whichplan <= node->as_nplans); @@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node) { if (whichplan >= node->as_nplans - 1) return false; - node->as_whichplan++; + node->as_whichsyncplan++; } else { if (whichplan <= 0) return false; - node->as_whichplan--; + node->as_whichsyncplan--; } return true; @@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) { /* Mark just-completed subplan as finished. */ - node->as_pstate->pa_finished[node->as_whichplan] = true; + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; } else { /* Start with last subplan. */ - node->as_whichplan = node->as_nplans - 1; + node->as_whichsyncplan = node->as_nplans - 1; } /* Loop until we find a subplan to execute. */ - while (pstate->pa_finished[node->as_whichplan]) + while (pstate->pa_finished[node->as_whichsyncplan]) { - if (node->as_whichplan == 0) + if (node->as_whichsyncplan == 0) { pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; LWLockRelease(&pstate->pa_lock); return false; } - node->as_whichplan--; + node->as_whichsyncplan--; } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); @@ -463,8 +642,8 @@ choose_next_subplan_for_worker(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); /* Mark just-completed subplan as finished. */ - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; /* If all the plans are already done, we have nothing to do */ if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX) @@ -489,10 +668,10 @@ choose_next_subplan_for_worker(AppendState *node) else { /* At last plan, no partial plans, arrange to bail out. */ - pstate->pa_next_plan = node->as_whichplan; + pstate->pa_next_plan = node->as_whichsyncplan; } - if (pstate->pa_next_plan == node->as_whichplan) + if (pstate->pa_next_plan == node->as_whichsyncplan) { /* We've tried everything! */ pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; @@ -502,7 +681,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Pick the plan we found, and advance pa_next_plan one more time. */ - node->as_whichplan = pstate->pa_next_plan++; + node->as_whichsyncplan = pstate->pa_next_plan++; if (pstate->pa_next_plan >= node->as_nplans) { if (append->first_partial_plan < node->as_nplans) @@ -518,8 +697,8 @@ choose_next_subplan_for_worker(AppendState *node) } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index 59865f5..9cb5470 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck); } - /* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan *) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node); } + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index e599283..d85cb9c 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); static Append *make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels); + int nasyncplans, int referent, + List *tlist, List *partitioned_rels); static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path); /* @@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) { Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all children return the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* + * Classify as async-capable or not. If we have decided to run the + * chidlren in parallel, we cannot any one of them run asynchronously. + */ + if (!best_path->path.parallel_safe && is_async_capable_path(subpath)) + { + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } /* @@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked to emit. */ - plan = make_append(subplans, best_path->first_partial_path, - tlist, best_path->partitioned_rels); + plan = make_append(list_concat(asyncplans, syncplans), + best_path->first_partial_path, nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5307,8 +5329,8 @@ make_foreignscan(List *qptlist, } static Append * -make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels) +make_append(List *appendplans, int first_partial_plan, int nasyncplans, + int referent, List *tlist, List *partitioned_rels) { Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5320,6 +5342,8 @@ make_append(List *appendplans, int first_partial_plan, node->partitioned_rels = partitioned_rels; node->appendplans = appendplans; node->first_partial_plan = first_partial_plan; + node->nasyncplans = nasyncplans; + node->referent = referent; return node; } @@ -6656,3 +6680,27 @@ is_projection_capable_plan(Plan *plan) } return true; } + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index d130114..667878b 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3673,6 +3673,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000..5fd67d9 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 6545a80..60f4e51 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -63,6 +63,7 @@ #define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */ #define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */ #define EXEC_FLAG_WITH_NO_DATA 0x0080 /* rel scannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution */ /* Hook for plugins to get control in ExecutorStart() */ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index ccb66be..67abf8e 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node, extern void ExecForeignScanInitializeWorker(ForeignScanState *node, ParallelWorkerContext *pwcxt); extern void ExecShutdownForeignScan(ForeignScanState *node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); #endif /* NODEFOREIGNSCAN_H */ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index e88fee3..beb3f0d 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root, typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit); /* * FdwRoutine is the struct returned by a foreign-data wrapper's handler @@ -182,6 +187,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; @@ -232,6 +238,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 4bb5cb1..405ad7b 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -851,6 +851,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for all PlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; + typedef struct PlanState { NodeTag type; @@ -891,6 +897,9 @@ typedef struct PlanState TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */ ExprContext *ps_ExprContext; /* node's expression-evaluation context */ ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */ + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */ } PlanState; /* ---------------- @@ -1013,10 +1022,16 @@ struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_nasyncplans; /* # of async-capable children */ ParallelAppendState *as_pstate; /* parallel coordination info */ + int as_whichsyncplan; /* which sync plan is being executed */ Size pstate_len; /* size of parallel coordination info */ bool (*choose_next_subplan) (AppendState *); + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */ }; /* ---------------- @@ -1567,6 +1582,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information */ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */ } ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index 74e9fb5..b4535f0 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -249,6 +249,8 @@ typedef struct Append List *partitioned_rels; List *appendplans; int first_partial_plan; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */ } Append; /* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 3d3c0b6..a1ba26f 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -831,7 +831,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT } WaitEventIPC; /* ---------- -- 2.9.2 From 6612fbe0cab492fedead1d35f1b9cdf24f3e6dd4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw --- contrib/postgres_fdw/connection.c | 26 ++ contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++--- contrib/postgres_fdw/postgres_fdw.c | 484 +++++++++++++++++++++---- contrib/postgres_fdw/postgres_fdw.h | 2 + contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +- 5 files changed, 522 insertions(+), 138 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 00c926b..4f3d59d 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ + void *storage; /* connection specific storage */ } ConnCacheEntry; /* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid, user->userid); + entry->storage = NULL; } /* @@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt) } /* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + +/* * Connect to remote server using specified server and user mapping properties. */ static PGconn * diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 683d641..3b4eefa 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa'); INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+------- a | aaa @@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | newtoo @@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33); insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for update; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+---- 1 | 11 @@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update; (4 rows) explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for share; + QUERY PLAN +---------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+---- 1 | 11 @@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid Inner Unique: true @@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 (39 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); @@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Hash Output: bar.f1, bar.f2, bar.ctid -> Seq Scan on public.bar @@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) (45 rows) update bar set f2 = f2 + 100 @@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *; (5 rows) explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; - QUERY PLAN ------------------------------------------------------------------------------- - Update on public.bar - Output: bar.f1, bar.f2 - Update on public.bar - Foreign Update on public.bar2 - -> Seq Scan on public.bar - Output: bar.f1, (bar.f2 + 100), bar.ctid - -> Foreign Update on public.bar2 - Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 -(8 rows) +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Output: u.f1, u.f2 + Sort Key: u.f1 + CTE u + -> Update on public.bar + Output: bar.f1, bar.f2 + Update on public.bar + Foreign Update on public.bar2 + -> Seq Scan on public.bar + Output: bar.f1, (bar.f2 + 100), bar.ctid + -> Foreign Update on public.bar2 + Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 + -> CTE Scan on u + Output: u.f1, u.f2 +(14 rows) -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2 ----+----- 1 | 311 2 | 322 - 6 | 266 3 | 333 4 | 344 + 6 | 266 7 | 277 (6 rows) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 7992ba5..5ea1d88 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -34,6 +36,7 @@ #include "optimizer/var.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" +#include "pgstat.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC; /* If no remote estimates, assume a sort costs 20% extra */ #define DEFAULT_FDW_SORT_MULTIPLIER 1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) + /* * Indexes of FDW-private information stored in fdw_private lists. * @@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex }; /* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *current_owner; /* The node currently running a query + * on this connection*/ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + +/* * Execution state of a foreign scan using postgres_fdw. */ typedef struct PgFdwScanState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed to query */ @@ -150,6 +173,13 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch */ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if last fetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool async_waiting; /* true if requesting the parent to wait */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* A waiting node at the end of a waiting + * list. Maintained only by the current + * owner of the connection */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +193,11 @@ typedef struct PgFdwScanState */ typedef struct PgFdwModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement, if created */ /* extracted fdw_private data */ @@ -190,6 +220,7 @@ typedef struct PgFdwModifyState */ typedef struct PgFdwDirectModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags); static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node); static void postgresReScanForeignScan(ForeignScanState *node); static void postgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node); static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage, RelOptInfo *input_rel, RelOptInfo *output_rel); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); /* * Helper functions @@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass *ec, EquivalenceMember *em, void *arg); static void create_cursor(ForeignScanState *node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn); +static void absorb_current_result(ForeignScanState *node); static void close_cursor(PGconn *conn, unsigned int cursor_number); static void prepare_foreign_modify(PgFdwModifyState *fmstate); static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate, @@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets; @@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine); } @@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->async_waiting = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) static TupleTableSlot * postgresIterateForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); - - /* * Get some more tuples, if we've run out. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ + ForeignScanState *next_conn_owner = node; + + /* This node has sent a query on this connection */ + if (fsstate->s.connpriv->current_owner == node) + { + /* Check if the result is available */ + if (PQisBusy(fsstate->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(fsstate->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (node->fs_async && !(rc & WL_SOCKET_READABLE)) + { + /* + * This node is not ready yet. Tell the caller to wait. + */ + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + return ExecClearTuple(slot); + } + } + + Assert(fsstate->async_waiting); + fsstate->async_waiting = false; + fetch_received_data(node); + + /* + * If someone is waiting this node on the same connection, let the + * first waiter be the next owner of this connection. + */ + if (fsstate->waiter) + { + PgFdwScanState *next_owner_state; + + next_conn_owner = fsstate->waiter; + next_owner_state = GetPgFdwScanState(next_conn_owner); + fsstate->waiter = NULL; + + /* + * only the current owner is responsible to maintain the shortcut + * to the last waiter + */ + next_owner_state->last_waiter = fsstate->last_waiter; + + /* + * for simplicity, last_waiter points itself on a node that no one + * is waiting for. + */ + fsstate->last_waiter = node; + } + } + else if (fsstate->s.connpriv->current_owner && + !GetPgFdwScanState(node)->eof_reached) + { + /* + * Anyone else is holding this connection and we want this node to + * run later. Add myself to the tail of the waiters' list then + * return not-ready. To avoid scanning through the waiters' list, + * the current owner is to maintain the shortcut to the last + * waiter. + */ + PgFdwScanState *conn_owner_state = + GetPgFdwScanState(fsstate->s.connpriv->current_owner); + ForeignScanState *last_waiter = conn_owner_state->last_waiter; + PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter); + + last_waiter_state->waiter = node; + conn_owner_state->last_waiter = node; + + /* Register the node to the async-waiting node list */ + Assert(!GetPgFdwScanState(node)->async_waiting); + + GetPgFdwScanState(node)->async_waiting = true; + + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; + return ExecClearTuple(slot); + } + + /* At this time no node is running on the connection */ + Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner + == NULL); + /* + * Send the next request for the next owner of this connection if + * needed. + */ + if (!GetPgFdwScanState(next_conn_owner)->eof_reached) + { + PgFdwScanState *next_owner_state = + GetPgFdwScanState(next_conn_owner); + + request_more_data(next_conn_owner); + + /* Register the node to the async-waiting node list */ + if (!next_owner_state->async_waiting) + next_owner_state->async_waiting = true; + + if (!next_conn_owner->fs_async) + fetch_received_data(next_conn_owner); + } + + + /* + * If we haven't received a result for the given node this time, + * return with no tuple to give way to other nodes. + */ if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; return ExecClearTuple(slot); + } } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node) static void postgresReScanForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + /* Absorb the ramining result */ + absorb_current_result(node); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node) static void postgresEndForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulation of cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */ } /* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* Absorb the ramining result */ + absorb_current_result(node); +} + +/* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table */ @@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statement not made yet */ /* Deconstruct fdw_private data. */ @@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL; } /* @@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Initialize state variable */ dmstate->num_tuples = -1; /* -1 means not set yet */ @@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, execute the statement. */ if (dmstate->num_tuples == -1) + { + vacate_connection((PgFdwState *)dmstate); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remote connection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* MemoryContext will be deleted automatically. */ } @@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoData sql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist = NIL; @@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user, false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + vacate_connection(&tmpstate); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, static void create_cursor(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node) * Fetch some more rows from the node's cursor. */ static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* The connection should be vacant */ + Assert(fsstate->s.connpriv->current_owner == NULL); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->current_owner = node; +} + +/* + * Fetch some more rows from the node's cursor. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + /* I should be the current connection owner */ + Assert(fsstate->s.connpriv->current_owner == node); + /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function. */ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res) != PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /* Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->current_owner = NULL; if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->current_owner = NULL; + MemoryContextSwitchTo(oldcontext); } /* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *owner; + + if (connpriv == NULL || connpriv->current_owner == NULL) + return; + + /* + * let the current connection owner read the result for the running query + */ + owner = connpriv->current_owner; + fetch_received_data(owner); + + /* Clear the waiting list */ + while (owner) + { + PgFdwScanState *fsstate = GetPgFdwScanState(owner); + + fsstate->last_waiter = NULL; + owner = fsstate->waiter; + fsstate->waiter = NULL; + } +} + +/* + * Absorb the result of the current query. + */ +static void +absorb_current_result(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *owner = fsstate->s.connpriv->current_owner; + + if (owner) + { + PgFdwScanState *target_state = GetPgFdwScanState(owner); + PGconn *conn = target_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->async_waiting = false; + } +} +/* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguous to the remote server. * @@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement. */ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this module are simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action shows that the prepare has been done. */ @@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assuming that the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check for success. @@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the join relation */ } +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection owner. Elsewise + * another node on this connection is the owner. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->current_owner == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + + /* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a side effect, save information we obtain in @@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res, PgFdwScanState *fdw_sstate; Assert(fsstate); - fdw_sstate = (PgFdwScanState *) fsstate->fdw_state; + fdw_sstate = GetPgFdwScanState(fsstate); tupdesc = fdw_sstate->tupdesc; } diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index 1ae809d..58ef26e 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode */ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINing ForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize); extern void ReleaseConnection(PGconn *conn); extern unsigned int GetCursorNumber(PGconn *conn); extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 3c3c5c7..cb9caa5 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; @@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) @@ -1653,8 +1653,8 @@ explain (verbose, costs off) delete from foo where f1 < 5 returning *; delete from foo where f1 < 5 returning *; explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; -- Test that UPDATE/DELETE with inherited target works with row-level triggers CREATE TRIGGER trig_row_before -- 2.9.2
At Thu, 11 Jan 2018 17:08:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180111.170839.23674040.horiguchi.kyotaro@lab.ntt.co.jp> > At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp> > > > The attached PoC patch theoretically has no impact on the normal > > > code paths and just brings gain in async cases. > > > > The parallel append just committed hit this and the attached are > > the rebased version to the current HEAD. The result of a concise > > performance test follows. > > > > patched(ms) unpatched(ms) gain(%) > > A: simple table scan : 3562.32 3444.81 -3.4 > > B: local partitioning : 1451.25 1604.38 9.5 > > C: single remote table : 8818.92 9297.76 5.1 > > D: sharding (single con) : 5966.14 6646.73 10.2 > > E: sharding (multi con) : 1802.25 6515.49 72.3 > > > > > A and B are degradation checks, which are expected to show no > > > degradation. C is the gain only by postgres_fdw's command > > > presending on a remote table. D is the gain of sharding on a > > > connection. The number of partitions/shards is 4. E is the gain > > > using dedicate connection per shard. > > > > Test A is accelerated by parallel sequential scan. Introducing > > parallel append accelerates test B. Comparing A and B, I doubt > > that degradation is stably measurable at least my environment but > > I believe that there's no degradation theoreticaly. The test C to > > E still shows apparent gain. > > regards, > > The patch conflicts with 3cac0ec. This is the rebased version. It hadn't been workable itself for a long time. - Rebased to current master. (Removed some wrongly-inserted lines) - Fixed wrong-positioned assertion in postgres_fdw.c (Caused assertion failure on normal usage) - Properly reset persistent (static) variable. (Caused SEGV under certain condition) - Fixed explain output of async-mixed append plan. (Choose proper subnode as the referent node) regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 6ab58d3fb02f716deaa207824747646dd8c2a448 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c | 2 +- src/backend/storage/ipc/latch.c | 18 ++++++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 ++++ 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index a4f6d4d..890972b 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index e6706f7..5457899 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index ef1d5ba..30edc8e 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) { WaitEventSet *new_event_set; - new_event_set = CreateWaitEventSet(TopMemoryContext, 2); + new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2); AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index e09a4f1..7ae8777 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc); static void PrintSnapshotLeakWarning(Snapshot snapshot); static void PrintFileLeakWarning(File file); static void PrintDSMLeakWarning(dsm_segment *seg); +static void PrintWESLeakWarning(WaitEventSet *events); /***************************************************************************** @@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL)); ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner; } @@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, PrintDSMLeakWarning(res); dsm_detach(res); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->snapshotarr.nitems == 0); Assert(owner->filearr.nitems == 0); Assert(owner->dsmarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->snapshotarr)); ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner); } @@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg) elog(WARNING, "dynamic shared memory leak: segment %u still referenced", dsm_segment_handle(seg)); } + +/* + * Make sure there is room for at least one more entry in a ResourceOwner's + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a4bcb48..838845a 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@ #define LATCH_H #include <signal.h> +#include "utils/resowner.h" /* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch); extern void SetLatch(volatile Latch *latch); extern void ResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents); extern void FreeWaitEventSet(WaitEventSet *set); extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index 22b377c..56f2059 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@ #include "storage/dsm.h" #include "storage/fd.h" +#include "storage/latch.h" #include "storage/lock.h" #include "utils/catcache.h" #include "utils/plancache.h" @@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner, extern void ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); + #endif /* RESOWNER_PRIVATE_H */ -- 2.9.2 From 60c663b3059e10302a71023eccb275da51331b39 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:23:51 +0900 Subject: [PATCH 2/3] core side modification --- src/backend/executor/Makefile | 2 +- src/backend/executor/execAsync.c | 145 ++++++++++++++++++++ src/backend/executor/nodeAppend.c | 228 ++++++++++++++++++++++++++++---- src/backend/executor/nodeForeignscan.c | 22 ++- src/backend/optimizer/plan/createplan.c | 62 ++++++++- src/backend/postmaster/pgstat.c | 3 + src/backend/utils/adt/ruleutils.c | 8 +- src/include/executor/execAsync.h | 23 ++++ src/include/executor/executor.h | 1 + src/include/executor/nodeForeignscan.h | 3 + src/include/foreign/fdwapi.h | 11 ++ src/include/nodes/execnodes.h | 18 ++- src/include/nodes/plannodes.h | 2 + src/include/pgstat.h | 3 +- 14 files changed, 489 insertions(+), 42 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index cc09895..8ad2adf 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executor top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o \ execMain.o execParallel.o execPartition.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000..db477e2 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,145 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +/* + * struct for memory context callback argument used in ExecAsyncEventWait + */ +typedef struct { + int **p_refind; + int *p_refindsize; +} ExecAsync_mcbarg; + +/* + * callback function to reset static variables pointing to the memory in + * TopTransactionContext in ExecAsyncEventWait. + */ +static void ExecAsyncMemoryContextCallback(void *arg) +{ + /* arg is the address of the variable refind in ExecAsyncEventWait */ + ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg; + *mcbarg->p_refind = NULL; + *mcbarg->p_refindsize = 0; +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + { + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; + MemoryContext oldctxt = + MemoryContextSwitchTo(TopTransactionContext); + + /* + * refind points to a memory block in + * TopTransactionContext. Register a callback to reset it. + */ + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb); + refind = (int *) palloc(refindsize * sizeof(int)); + MemoryContextSwitchTo(oldctxt); + } + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 7a3dd2e..df1f7ae 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,6 +59,7 @@ #include "executor/execdebug.h" #include "executor/nodeAppend.h" +#include "executor/execAsync.h" #include "miscadmin.h" /* Shared state for parallel-aware Append. */ @@ -79,6 +80,7 @@ struct ParallelAppendState #define INVALID_SUBPLAN_INDEX -1 static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate); static bool choose_next_subplan_locally(AppendState *node); static bool choose_next_subplan_for_leader(AppendState *node); static bool choose_next_subplan_for_worker(AppendState *node); @@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ListCell *lc; /* check for unsupported flags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC))); /* * Lock the non-leaf tables in the partition tree controlled by this node. @@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->ps.ExecProcNode = ExecAppend; appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + appendstate->as_nasyncplans = node->nasyncplans; + appendstate->as_syncdone = (node->nasyncplans == nplans); + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* Choose async version of Exec function */ + if (appendstate->as_nasyncplans > 0) + appendstate->ps.ExecProcNode = ExecAppendAsync; + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); /* * Initialize result tuple type and slot. @@ -141,11 +156,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags) foreach(lc, node->appendplans) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; - appendplanstates[i] = ExecInitNode(initNode, estate, eflags); + if (i < appendstate->as_nasyncplans) + sub_eflags |= EXEC_FLAG_ASYNC; + + appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags); i++; } + /* if there's any async-capable subnode, use async-aware routine */ + if (appendstate->as_nasyncplans) + appendstate->ps.ExecProcNode = ExecAppendAsync; + /* * Miscellaneous initialization * @@ -159,8 +182,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags) * looking at shared memory, but non-parallel-aware append plans can * always start with the first subplan. */ - appendstate->as_whichplan = - appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; + if (appendstate->ps.plan->parallel_aware) + appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX; + else if (appendstate->as_nasyncplans > 0) + appendstate->as_whichsyncplan = appendstate->as_nasyncplans; + else + appendstate->as_whichsyncplan = 0; /* If parallel-aware, this will be overridden later. */ appendstate->choose_next_subplan = choose_next_subplan_locally; @@ -180,10 +207,12 @@ ExecAppend(PlanState *pstate) AppendState *node = castNode(AppendState, pstate); /* If no subplan has been chosen, we must choose one before proceeding. */ - if (node->as_whichplan == INVALID_SUBPLAN_INDEX && + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && !node->choose_next_subplan(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -194,8 +223,9 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans); - subnode = node->appendplans[node->as_whichplan]; + Assert(node->as_whichsyncplan >= 0 && + node->as_whichsyncplan < node->as_nplans); + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -218,6 +248,137 @@ ExecAppend(PlanState *pstate) } } +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async, + timeout); + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan in the appropriate direction. If no + * more subplans, return the empty slot set up for us by + * ExecInitAppend, unless there are async plans we have yet to finish. + */ + if (!node->choose_next_subplan(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} + /* ---------------------------------------------------------------- * ExecEndAppend * @@ -251,6 +412,15 @@ ExecReScanAppend(AppendState *node) { int i; + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -270,7 +440,7 @@ ExecReScanAppend(AppendState *node) ExecReScan(subnode); } - node->as_whichplan = + node->as_whichsyncplan = node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0; } @@ -359,7 +529,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt) static bool choose_next_subplan_locally(AppendState *node) { - int whichplan = node->as_whichplan; + int whichplan = node->as_whichsyncplan; /* We should never see INVALID_SUBPLAN_INDEX in this case. */ Assert(whichplan >= 0 && whichplan <= node->as_nplans); @@ -368,13 +538,13 @@ choose_next_subplan_locally(AppendState *node) { if (whichplan >= node->as_nplans - 1) return false; - node->as_whichplan++; + node->as_whichsyncplan++; } else { if (whichplan <= 0) return false; - node->as_whichplan--; + node->as_whichsyncplan--; } return true; @@ -399,33 +569,33 @@ choose_next_subplan_for_leader(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) { /* Mark just-completed subplan as finished. */ - node->as_pstate->pa_finished[node->as_whichplan] = true; + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; } else { /* Start with last subplan. */ - node->as_whichplan = node->as_nplans - 1; + node->as_whichsyncplan = node->as_nplans - 1; } /* Loop until we find a subplan to execute. */ - while (pstate->pa_finished[node->as_whichplan]) + while (pstate->pa_finished[node->as_whichsyncplan]) { - if (node->as_whichplan == 0) + if (node->as_whichsyncplan == 0) { pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; LWLockRelease(&pstate->pa_lock); return false; } - node->as_whichplan--; + node->as_whichsyncplan--; } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); @@ -457,8 +627,8 @@ choose_next_subplan_for_worker(AppendState *node) LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); /* Mark just-completed subplan as finished. */ - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; /* If all the plans are already done, we have nothing to do */ if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX) @@ -468,7 +638,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Save the plan from which we are starting the search. */ - node->as_whichplan = pstate->pa_next_plan; + node->as_whichsyncplan = pstate->pa_next_plan; /* Loop until we find a subplan to execute. */ while (pstate->pa_finished[pstate->pa_next_plan]) @@ -478,7 +648,7 @@ choose_next_subplan_for_worker(AppendState *node) /* Advance to next plan. */ pstate->pa_next_plan++; } - else if (node->as_whichplan > append->first_partial_plan) + else if (node->as_whichsyncplan > append->first_partial_plan) { /* Loop back to first partial plan. */ pstate->pa_next_plan = append->first_partial_plan; @@ -489,10 +659,10 @@ choose_next_subplan_for_worker(AppendState *node) * At last plan, and either there are no partial plans or we've * tried them all. Arrange to bail out. */ - pstate->pa_next_plan = node->as_whichplan; + pstate->pa_next_plan = node->as_whichsyncplan; } - if (pstate->pa_next_plan == node->as_whichplan) + if (pstate->pa_next_plan == node->as_whichsyncplan) { /* We've tried everything! */ pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; @@ -502,7 +672,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Pick the plan we found, and advance pa_next_plan one more time. */ - node->as_whichplan = pstate->pa_next_plan++; + node->as_whichsyncplan = pstate->pa_next_plan++; if (pstate->pa_next_plan >= node->as_nplans) { if (append->first_partial_plan < node->as_nplans) @@ -518,8 +688,8 @@ choose_next_subplan_for_worker(AppendState *node) } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < append->first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < append->first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index 0084234..7da1ac5 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck); } - /* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan *) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -383,3 +386,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node); } + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index da0cc7f..24c838d 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); static Append *make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels); + int nasyncplans, int referent, + List *tlist, List *partitioned_rels); static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -287,6 +288,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path); /* @@ -1020,8 +1022,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) { Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1056,7 +1062,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all children return the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* + * Classify as async-capable or not. If we have decided to run the + * chidlren in parallel, we cannot any one of them run asynchronously. + */ + if (!best_path->path.parallel_safe && is_async_capable_path(subpath)) + { + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } /* @@ -1066,8 +1086,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked to emit. */ - plan = make_append(subplans, best_path->first_partial_path, - tlist, best_path->partitioned_rels); + plan = make_append(list_concat(asyncplans, syncplans), + best_path->first_partial_path, nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5319,8 +5341,8 @@ make_foreignscan(List *qptlist, } static Append * -make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels) +make_append(List *appendplans, int first_partial_plan, int nasyncplans, + int referent, List *tlist, List *partitioned_rels) { Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5332,6 +5354,8 @@ make_append(List *appendplans, int first_partial_plan, node->partitioned_rels = partitioned_rels; node->appendplans = appendplans; node->first_partial_plan = first_partial_plan; + node->nasyncplans = nasyncplans; + node->referent = referent; return node; } @@ -6677,3 +6701,27 @@ is_projection_capable_plan(Plan *plan) } return true; } + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 96ba216..08eac23 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3676,6 +3676,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index ba9fab4..6837642 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -4463,7 +4463,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) dpns->planstate = ps; /* - * We special-case Append and MergeAppend to pretend that the first child + * We special-case Append and MergeAppend to pretend that a specific child * plan is the OUTER referent; we have to interpret OUTER Vars in their * tlists according to one of the children, and the first one is the most * natural choice. Likewise special-case ModifyTable to pretend that the @@ -4471,7 +4471,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) * lists containing references to non-target relations. */ if (IsA(ps, AppendState)) - dpns->outer_planstate = ((AppendState *) ps)->appendplans[0]; + { + AppendState *aps = (AppendState *) ps; + Append *app = (Append *) ps->plan; + dpns->outer_planstate = aps->appendplans[app->referent]; + } else if (IsA(ps, MergeAppendState)) dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0]; else if (IsA(ps, ModifyTableState)) diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000..5fd67d9 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 45a077a..54cc358 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -64,6 +64,7 @@ #define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */ #define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */ #define EXEC_FLAG_WITH_NO_DATA 0x0080 /* rel scannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution */ /* Hook for plugins to get control in ExecutorStart() */ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index ccb66be..67abf8e 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node, extern void ExecForeignScanInitializeWorker(ForeignScanState *node, ParallelWorkerContext *pwcxt); extern void ExecShutdownForeignScan(ForeignScanState *node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); #endif /* NODEFOREIGNSCAN_H */ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index e88fee3..beb3f0d 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root, typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit); /* * FdwRoutine is the struct returned by a foreign-data wrapper's handler @@ -182,6 +187,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; @@ -232,6 +238,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index a953820..c9c3db2 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -861,6 +861,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for all PlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; + typedef struct PlanState { NodeTag type; @@ -901,6 +907,9 @@ typedef struct PlanState TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */ ExprContext *ps_ExprContext; /* node's expression-evaluation context */ ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */ + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */ } PlanState; /* ---------------- @@ -1023,10 +1032,16 @@ struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_nasyncplans; /* # of async-capable children */ ParallelAppendState *as_pstate; /* parallel coordination info */ + int as_whichsyncplan; /* which sync plan is being executed */ Size pstate_len; /* size of parallel coordination info */ bool (*choose_next_subplan) (AppendState *); + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */ }; /* ---------------- @@ -1577,6 +1592,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information */ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */ } ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index f2e19ea..64ee18e 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -250,6 +250,8 @@ typedef struct Append List *partitioned_rels; List *appendplans; int first_partial_plan; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */ } Append; /* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index be2f592..6f4583b 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -832,7 +832,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT } WaitEventIPC; /* ---------- -- 2.9.2 From c2195953a34fe7c0574631e5c118a948263dc755 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw --- contrib/postgres_fdw/connection.c | 26 ++ contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++--- contrib/postgres_fdw/postgres_fdw.c | 484 +++++++++++++++++++++---- contrib/postgres_fdw/postgres_fdw.h | 2 + contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +- 5 files changed, 522 insertions(+), 138 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 00c926b..4f3d59d 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ + void *storage; /* connection specific storage */ } ConnCacheEntry; /* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid, user->userid); + entry->storage = NULL; } /* @@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt) } /* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + +/* * Connect to remote server using specified server and user mapping properties. */ static PGconn * diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 262c635..29ba813 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6790,7 +6790,7 @@ INSERT INTO a(aa) VALUES('aaaaa'); INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+------- a | aaa @@ -6818,7 +6818,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6846,7 +6846,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6874,7 +6874,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | newtoo @@ -6940,35 +6940,40 @@ insert into bar2 values(3,33,33); insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for update; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+---- 1 | 11 @@ -6978,35 +6983,40 @@ select * from bar where f1 in (select f1 from foo) for update; (4 rows) explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for share; + QUERY PLAN +---------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(28 rows) -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+---- 1 | 11 @@ -7036,11 +7046,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid Inner Unique: true @@ -7054,11 +7064,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 (39 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); @@ -7089,16 +7099,16 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Hash Output: bar.f1, bar.f2, bar.ctid -> Seq Scan on public.bar @@ -7116,16 +7126,16 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) (45 rows) update bar set f2 = f2 + 100 @@ -7276,27 +7286,33 @@ delete from foo where f1 < 5 returning *; (5 rows) explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; - QUERY PLAN ------------------------------------------------------------------------------- - Update on public.bar - Output: bar.f1, bar.f2 - Update on public.bar - Foreign Update on public.bar2 - -> Seq Scan on public.bar - Output: bar.f1, (bar.f2 + 100), bar.ctid - -> Foreign Update on public.bar2 - Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 -(8 rows) +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Output: u.f1, u.f2 + Sort Key: u.f1 + CTE u + -> Update on public.bar + Output: bar.f1, bar.f2 + Update on public.bar + Foreign Update on public.bar2 + -> Seq Scan on public.bar + Output: bar.f1, (bar.f2 + 100), bar.ctid + -> Foreign Update on public.bar2 + Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 + -> CTE Scan on u + Output: u.f1, u.f2 +(14 rows) -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2 ----+----- 1 | 311 2 | 322 - 6 | 266 3 | 333 4 | 344 + 6 | 266 7 | 277 (6 rows) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 941a2e7..337c728 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -34,6 +36,7 @@ #include "optimizer/var.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" +#include "pgstat.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC; /* If no remote estimates, assume a sort costs 20% extra */ #define DEFAULT_FDW_SORT_MULTIPLIER 1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) + /* * Indexes of FDW-private information stored in fdw_private lists. * @@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex }; /* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *current_owner; /* The node currently running a query + * on this connection*/ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + +/* * Execution state of a foreign scan using postgres_fdw. */ typedef struct PgFdwScanState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed to query */ @@ -150,6 +173,13 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch */ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if last fetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool async_waiting; /* true if requesting the parent to wait */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* A waiting node at the end of a waiting + * list. Maintained only by the current + * owner of the connection */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +193,11 @@ typedef struct PgFdwScanState */ typedef struct PgFdwModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement, if created */ /* extracted fdw_private data */ @@ -190,6 +220,7 @@ typedef struct PgFdwModifyState */ typedef struct PgFdwDirectModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -293,6 +324,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags); static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node); static void postgresReScanForeignScan(ForeignScanState *node); static void postgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node); static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -353,6 +385,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage, RelOptInfo *input_rel, RelOptInfo *output_rel); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); /* * Helper functions @@ -373,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass *ec, EquivalenceMember *em, void *arg); static void create_cursor(ForeignScanState *node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn); +static void absorb_current_result(ForeignScanState *node); static void close_cursor(PGconn *conn, unsigned int cursor_number); static void prepare_foreign_modify(PgFdwModifyState *fmstate); static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate, @@ -452,6 +491,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets; @@ -486,6 +526,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine); } @@ -1336,12 +1380,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->async_waiting = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1397,32 +1450,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) static TupleTableSlot * postgresIterateForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); - - /* * Get some more tuples, if we've run out. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ - if (fsstate->next_tuple >= fsstate->num_tuples) + ForeignScanState *next_conn_owner = node; + + /* This node has sent a query on this connection */ + if (fsstate->s.connpriv->current_owner == node) + { + /* Check if the result is available */ + if (PQisBusy(fsstate->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(fsstate->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (node->fs_async && !(rc & WL_SOCKET_READABLE)) + { + /* + * This node is not ready yet. Tell the caller to wait. + */ + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + return ExecClearTuple(slot); + } + } + + Assert(fsstate->async_waiting); + fsstate->async_waiting = false; + fetch_received_data(node); + + /* + * If someone is waiting this node on the same connection, let the + * first waiter be the next owner of this connection. + */ + if (fsstate->waiter) + { + PgFdwScanState *next_owner_state; + + next_conn_owner = fsstate->waiter; + next_owner_state = GetPgFdwScanState(next_conn_owner); + fsstate->waiter = NULL; + + /* + * only the current owner is responsible to maintain the shortcut + * to the last waiter + */ + next_owner_state->last_waiter = fsstate->last_waiter; + + /* + * for simplicity, last_waiter points itself on a node that no one + * is waiting for. + */ + fsstate->last_waiter = node; + } + } + else if (fsstate->s.connpriv->current_owner && + !GetPgFdwScanState(node)->eof_reached) + { + /* + * Anyone else is holding this connection and we want this node to + * run later. Add myself to the tail of the waiters' list then + * return not-ready. To avoid scanning through the waiters' list, + * the current owner is to maintain the shortcut to the last + * waiter. + */ + PgFdwScanState *conn_owner_state = + GetPgFdwScanState(fsstate->s.connpriv->current_owner); + ForeignScanState *last_waiter = conn_owner_state->last_waiter; + PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter); + + last_waiter_state->waiter = node; + conn_owner_state->last_waiter = node; + + /* Register the node to the async-waiting node list */ + Assert(!GetPgFdwScanState(node)->async_waiting); + + GetPgFdwScanState(node)->async_waiting = true; + + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; return ExecClearTuple(slot); + } + + /* + * Send the next request for the next owner of this connection if + * needed. + */ + if (!GetPgFdwScanState(next_conn_owner)->eof_reached) + { + PgFdwScanState *next_owner_state = + GetPgFdwScanState(next_conn_owner); + + /* No one is running on this connection at this time */ + Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner + == NULL); + request_more_data(next_conn_owner); + + /* Register the node to the async-waiting node list */ + if (!next_owner_state->async_waiting) + next_owner_state->async_waiting = true; + + if (!next_conn_owner->fs_async) + fetch_received_data(next_conn_owner); + } + + + /* + * If we haven't received a result for the given node this time, + * return with no tuple to give way to other nodes. + */ + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->result_ready = fsstate->eof_reached; + node->ss.ps.asyncstate = + fsstate->result_ready ? AS_AVAILABLE : AS_WAITING; + return ExecClearTuple(slot); + } } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1438,7 +1595,7 @@ postgresIterateForeignScan(ForeignScanState *node) static void postgresReScanForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1446,6 +1603,9 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + /* Absorb the ramining result */ + absorb_current_result(node); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1474,9 +1634,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1494,7 +1654,7 @@ postgresReScanForeignScan(ForeignScanState *node) static void postgresEndForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1502,16 +1662,32 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulation of cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */ } /* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* Absorb the ramining result */ + absorb_current_result(node); +} + +/* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table */ @@ -1714,7 +1890,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statement not made yet */ /* Deconstruct fdw_private data. */ @@ -1793,6 +1971,8 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1803,14 +1983,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1818,10 +1998,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1859,6 +2039,8 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1879,14 +2061,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1894,10 +2076,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1935,6 +2117,8 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + vacate_connection((PgFdwState *)fmstate); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1955,14 +2139,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1970,10 +2154,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2020,16 +2204,16 @@ postgresEndForeignModify(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL; } /* @@ -2353,7 +2537,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Update the foreign-join-related fields. */ if (fsplan->scan.scanrelid == 0) @@ -2438,7 +2624,10 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, execute the statement. */ if (dmstate->num_tuples == -1) + { + vacate_connection((PgFdwState *)dmstate); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2485,8 +2674,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remote connection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* close the target relation. */ if (dmstate->resultRel) @@ -2609,6 +2798,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoData sql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist = NIL; @@ -2651,6 +2841,16 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user, false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + vacate_connection(&tmpstate); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -3005,11 +3205,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, static void create_cursor(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -3075,47 +3275,96 @@ create_cursor(ForeignScanState *node) * Fetch some more rows from the node's cursor. */ static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* The connection should be vacant */ + Assert(fsstate->s.connpriv->current_owner == NULL); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->current_owner = node; +} + +/* + * Fetch some more rows from the node's cursor. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + /* I should be the current connection owner */ + Assert(fsstate->s.connpriv->current_owner == node); + /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function. */ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res) != PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /* Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3125,27 +3374,82 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->current_owner = NULL; if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->current_owner = NULL; + MemoryContextSwitchTo(oldcontext); } /* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *owner; + + if (connpriv == NULL || connpriv->current_owner == NULL) + return; + + /* + * let the current connection owner read the result for the running query + */ + owner = connpriv->current_owner; + fetch_received_data(owner); + + /* Clear the waiting list */ + while (owner) + { + PgFdwScanState *fsstate = GetPgFdwScanState(owner); + + fsstate->last_waiter = NULL; + owner = fsstate->waiter; + fsstate->waiter = NULL; + } +} + +/* + * Absorb the result of the current query. + */ +static void +absorb_current_result(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *owner = fsstate->s.connpriv->current_owner; + + if (owner) + { + PgFdwScanState *target_state = GetPgFdwScanState(owner); + PGconn *conn = target_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + fsstate->s.connpriv->current_owner = NULL; + fsstate->async_waiting = false; + } +} +/* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguous to the remote server. * @@ -3229,7 +3533,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement. */ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3239,12 +3543,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this module are simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -3252,9 +3556,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action shows that the prepare has been done. */ @@ -3515,9 +3819,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assuming that the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check for success. @@ -3525,10 +3829,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -5007,6 +5311,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the join relation */ } +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection owner. Elsewise + * another node on this connection is the owner. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->current_owner == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + + /* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a side effect, save information we obtain in diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index d37cc88..132367a 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode */ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINing ForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize); extern void ReleaseConnection(PGconn *conn); extern unsigned int GetCursorNumber(PGconn *conn); extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 2863549..9ba8135 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1614,25 +1614,25 @@ INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; @@ -1668,12 +1668,12 @@ insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) @@ -1732,8 +1732,8 @@ explain (verbose, costs off) delete from foo where f1 < 5 returning *; delete from foo where f1 < 5 returning *; explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; -- Test that UPDATE/DELETE with inherited target works with row-level triggers CREATE TRIGGER trig_row_before -- 2.9.2
Hello. This is the new version of $Subject. But, this is not just a rebased version. On the way fixing serious conflicts, I refactored patch and I believe this becomes way readable than the previous shape. # 0003 lacks changes of postgres_fdw.out now. - Waiting queue manipulation is moved into new functions. It had a bug that the same node can be inserted in the queue more than once and it is fixed. - postgresIterateForeginScan had somewhat a tricky strcuture to merge similar procedures thus it cannot be said easy-to-read at all. Now it is far simpler and straight-forward looking. - Still this works only on Append/ForeignScan. > > The attached PoC patch theoretically has no impact on the normal > > code paths and just brings gain in async cases. I performed almost the same test as before but with: - partition tables (There should be no difference with inheritance.) - added test for fetch_size of 200 and 1000 as long as 100. 100 unreasonably magnifies the lag by context switching on single poor box and the test D/F below. (They became faster by about twice by adding a small delay (1000 times of clock_gettime()(*1)) just before epoll_wait so that it doesn't sleep (I suppose)...) - Table size of test B is one tenth of the previous size, the same to one partition. *1: The reason for the function is that first I found that the queries get way faster by just prefixing by "explain analyze".. > patched(ms) unpatched(ms) gain(%) > A: simple table scan : 3562.32 3444.81 -3.4 > B: local partitioning : 1451.25 1604.38 9.5 > C: single remote table : 8818.92 9297.76 5.1 > D: sharding (single con) : 5966.14 6646.73 10.2 > E: sharding (multi con) : 1802.25 6515.49 72.3 fetch_size = 100 patched(ms) unpatched(ms) gain(%) A: simple table scan : 3033.48 2997.44 -1.2 B: local partitioning : 1405.52 1426.66 1.5 C: single remote table : 8335.50 8463.22 1.5 D: sharding (single con) : 6862.92 6820.97 -0.6 E: sharding (multi con) : 2185.84 6733.63 67.5 F: partition (single con): 6818.13 6741.01 -1.1 G: partition (multi con) : 2150.58 6407.46 66.4 fetch_size = 200 patched(ms) unpatched(ms) gain(%) A: simple table scan : B: local partitioning : C: single remote table : D: sharding (single con) : E: sharding (multi con) : F: partition (single con): G: partition (multi con) : fetch_size = 1000 patched(ms) unpatched(ms) gain(%) A: simple table scan : 3050.31 2980.29 -2.3 B: local partitioning : 1401.34 1419.54 1.3 C: single remote table : 8375.4 8445.27 0.8 D: sharding (single con) : 3935.97 4737.84 16.9 E: sharding (multi con) : 1330.44 4752.87 72.0 F: partition (single con): 3997.63 4747.44 15.8 G: partition (multi con) : 1323.02 4807.72 72.5 Async append doesn't affect non-async path at all so B is expected to get no degradation. It seems within error. C and F are the gain when all foreign tables share one connection and D and G are the gain when every foreign tables has dedicate connection. I will repost after filling the blank portion of the tables and complete regression of the patch next week. Sorry for the incomplete post. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 7ad4210dd20b6672367255492e2b1d95cd90b122 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c | 2 +- src/backend/storage/ipc/latch.c | 18 ++++++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 ++++ 6 files changed, 96 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index a4f6d4deeb..890972b9b8 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index e6706f7fb8..5457899f2d 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index ef1d5baf01..30edc8e83a 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) { WaitEventSet *new_event_set; - new_event_set = CreateWaitEventSet(TopMemoryContext, 2); + new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2); AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index bce021e100..802b79a660 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -126,6 +126,7 @@ typedef struct ResourceOwnerData ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ ResourceArray jitarr; /* JIT contexts */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc); static void PrintSnapshotLeakWarning(Snapshot snapshot); static void PrintFileLeakWarning(File file); static void PrintDSMLeakWarning(dsm_segment *seg); +static void PrintWESLeakWarning(WaitEventSet *events); /***************************************************************************** @@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner; } @@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, jit_release_context(context); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->filearr.nitems == 0); Assert(owner->dsmarr.nitems == 0); Assert(owner->jitarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); ResourceArrayFree(&(owner->jitarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner); } @@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle) elog(ERROR, "JIT context %p is not owned by resource owner %s", DatumGetPointer(handle), owner->name); } + +/* + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a4bcb48874..838845af01 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@ #define LATCH_H #include <signal.h> +#include "utils/resowner.h" /* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch); extern void SetLatch(volatile Latch *latch); extern void ResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents); extern void FreeWaitEventSet(WaitEventSet *set); extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index a6e8eb71ab..3c06e4c3f8 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@ #include "storage/dsm.h" #include "storage/fd.h" +#include "storage/latch.h" #include "storage/lock.h" #include "utils/catcache.h" #include "utils/plancache.h" @@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner, extern void ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); + #endif /* RESOWNER_PRIVATE_H */ -- 2.16.3 From 0b3b692e677f7fd19f618582412acf9d12231bb2 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:23:51 +0900 Subject: [PATCH 2/3] core side modification --- contrib/postgres_fdw/expected/postgres_fdw.out | 100 +++++----- src/backend/commands/explain.c | 17 ++ src/backend/executor/Makefile | 2 +- src/backend/executor/execAsync.c | 145 ++++++++++++++ src/backend/executor/nodeAppend.c | 262 +++++++++++++++++++++---- src/backend/executor/nodeForeignscan.c | 22 ++- src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c | 2 + src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/plan/createplan.c | 68 ++++++- src/backend/postmaster/pgstat.c | 3 + src/backend/utils/adt/ruleutils.c | 8 +- src/include/executor/execAsync.h | 23 +++ src/include/executor/executor.h | 1 + src/include/executor/nodeForeignscan.h | 3 + src/include/foreign/fdwapi.h | 11 ++ src/include/nodes/execnodes.h | 18 +- src/include/nodes/plannodes.h | 7 + src/include/pgstat.h | 3 +- 19 files changed, 603 insertions(+), 96 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index bb6b1a8fdf..248aa73c0b 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6968,12 +6968,13 @@ select * from bar where f1 in (select f1 from foo) for update; Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(29 rows) select * from bar where f1 in (select f1 from foo) for update; f1 | f2 @@ -7006,12 +7007,13 @@ select * from bar where f1 in (select f1 from foo) for share; Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(29 rows) select * from bar where f1 in (select f1 from foo) for share; f1 | f2 @@ -7043,9 +7045,8 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join @@ -7061,12 +7062,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(39 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(41 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); select tableoid::regclass, * from bar order by 1,2; @@ -7096,14 +7098,11 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 2 + -> Async Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) - -> Foreign Scan on public.foo2 foo2_1 + -> Async Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 -> Hash @@ -7123,17 +7122,18 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 2 + -> Async Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) - -> Foreign Scan on public.foo2 foo2_1 + -> Async Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 -(45 rows) + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) +(47 rows) update bar set f2 = f2 + 100 from @@ -8155,11 +8155,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J Sort Sort Key: t1.a, t3.c -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3) - -> Foreign Scan + -> Async Foreign Scan Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3) -(7 rows) +(8 rows) SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a% 25 =0 ORDER BY 1,2,3; a | b | c @@ -8178,9 +8179,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) Sort Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c -> Append - -> Foreign Scan + Async subplans: 1 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) -(5 rows) +(6 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; a | b | c @@ -8200,11 +8202,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE Sort Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2) -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Foreign Scan + -> Async Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(7 rows) +(8 rows) SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2; t1 | t2 @@ -8223,11 +8226,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t Sort Sort Key: t1.a, t1.b -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Foreign Scan + -> Async Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(7 rows) +(8 rows) SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25= 0 ORDER BY 1,2; a | b @@ -8309,10 +8313,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Group Key: fpagg_tab_p1.a Filter: (avg(fpagg_tab_p1.b) < '22'::numeric) -> Append - -> Foreign Scan on fpagg_tab_p1 - -> Foreign Scan on fpagg_tab_p2 - -> Foreign Scan on fpagg_tab_p3 -(9 rows) + Async subplans: 3 + -> Async Foreign Scan on fpagg_tab_p1 + -> Async Foreign Scan on fpagg_tab_p2 + -> Async Foreign Scan on fpagg_tab_p3 +(10 rows) -- Plan with partitionwise aggregates is enabled SET enable_partitionwise_aggregate TO true; @@ -8323,13 +8328,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Sort Sort Key: fpagg_tab_p1.a -> Append - -> Foreign Scan + Async subplans: 3 + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab) - -> Foreign Scan + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab) - -> Foreign Scan + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab) -(9 rows) +(10 rows) SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1; a | sum | min | count diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 73d94b7235..09c5327cb4 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es); static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ExplainState *es); +static void show_append_info(AppendState *astate, ExplainState *es); static void show_agg_keys(AggState *astate, List *ancestors, ExplainState *es); static void show_grouping_sets(PlanState *planstate, Agg *agg, @@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors, } if (plan->parallel_aware) appendStringInfoString(es->str, "Parallel "); + if (plan->async_capable) + appendStringInfoString(es->str, "Async "); appendStringInfoString(es->str, pname); es->indent++; } @@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_Hash: show_hash_info(castNode(HashState, planstate), es); break; + + case T_Append: + show_append_info(castNode(AppendState, planstate), es); + break; + default: break; } @@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ancestors, es); } +static void +show_append_info(AppendState *astate, ExplainState *es) +{ + Append *plan = (Append *) astate->ps.plan; + + if (plan->nasyncplans > 0) + ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es); +} + /* * Show the grouping keys for an Agg node. */ diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index cc09895fa5..8ad2adfe1c 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executor top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o \ execMain.o execParallel.o execPartition.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000000..db477e2cf6 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,145 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +/* + * struct for memory context callback argument used in ExecAsyncEventWait + */ +typedef struct { + int **p_refind; + int *p_refindsize; +} ExecAsync_mcbarg; + +/* + * callback function to reset static variables pointing to the memory in + * TopTransactionContext in ExecAsyncEventWait. + */ +static void ExecAsyncMemoryContextCallback(void *arg) +{ + /* arg is the address of the variable refind in ExecAsyncEventWait */ + ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg; + *mcbarg->p_refind = NULL; + *mcbarg->p_refindsize = 0; +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + { + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; + MemoryContext oldctxt = + MemoryContextSwitchTo(TopTransactionContext); + + /* + * refind points to a memory block in + * TopTransactionContext. Register a callback to reset it. + */ + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb); + refind = (int *) palloc(refindsize * sizeof(int)); + MemoryContextSwitchTo(oldctxt); + } + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 6bc3e470bf..ed8612dd37 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -60,6 +60,7 @@ #include "executor/execdebug.h" #include "executor/execPartition.h" #include "executor/nodeAppend.h" +#include "executor/execAsync.h" #include "miscadmin.h" /* Shared state for parallel-aware Append. */ @@ -81,6 +82,7 @@ struct ParallelAppendState #define NO_MATCHING_SUBPLANS -2 static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate); static bool choose_next_subplan_locally(AppendState *node); static bool choose_next_subplan_for_leader(AppendState *node); static bool choose_next_subplan_for_worker(AppendState *node); @@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags) PlanState **appendplanstates; Bitmapset *validsubplans; int nplans; + int nasyncplans; int firstvalid; int i, j; ListCell *lc; /* check for unsupported flags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC))); /* * Lock the non-leaf tables in the partition tree controlled by this node. @@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ appendstate->ps.plan = (Plan *) node; appendstate->ps.state = estate; - appendstate->ps.ExecProcNode = ExecAppend; + + /* choose appropriate version of Exec function */ + if (node->nasyncplans == 0) + appendstate->ps.ExecProcNode = ExecAppend; + else + appendstate->ps.ExecProcNode = ExecAppendAsync; /* Let choose_next_subplan_* function handle setting the first subplan */ - appendstate->as_whichplan = INVALID_SUBPLAN_INDEX; + appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX; /* If run-time partition pruning is enabled, then set that up now */ if (node->part_prune_infos != NIL) @@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ if (bms_is_empty(validsubplans)) { - appendstate->as_whichplan = NO_MATCHING_SUBPLANS; + appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS; /* Mark the first as valid so that it's initialized below */ validsubplans = bms_make_singleton(0); @@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ j = i = 0; firstvalid = nplans; + nasyncplans = 0; foreach(lc, node->appendplans) { if (bms_is_member(i, validsubplans)) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; + + /* Let async-capable subplans run asynchronously */ + if (i < node->nasyncplans) + { + sub_eflags |= EXEC_FLAG_ASYNC; + nasyncplans++; + } /* * Record the lowest appendplans index which is a valid partial @@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) if (i >= node->first_partial_plan && j < firstvalid) firstvalid = j; - appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); + appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags); } i++; } @@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + /* fill in async stuff */ + appendstate->as_nasyncplans = nasyncplans; + appendstate->as_syncdone = (nasyncplans == nplans); + + if (appendstate->as_nasyncplans) + { + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); + } + /* * Miscellaneous initialization */ @@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate) { AppendState *node = castNode(AppendState, pstate); - if (node->as_whichplan < 0) + if (node->as_whichsyncplan < 0) { /* * If no subplan has been chosen, we must choose one before * proceeding. */ - if (node->as_whichplan == INVALID_SUBPLAN_INDEX && + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && !node->choose_next_subplan(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); /* Nothing to do if there are no matching subplans */ - else if (node->as_whichplan == NO_MATCHING_SUBPLANS) + else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS) return ExecClearTuple(node->ps.ps_ResultTupleSlot); } + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans); - subnode = node->appendplans[node->as_whichplan]; + Assert(node->as_whichsyncplan >= 0 && + node->as_whichsyncplan < node->as_nplans); + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -307,6 +342,156 @@ ExecAppend(PlanState *pstate) } } +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, + node->as_pending_async, + timeout); + Assert(!node->as_syncdone || !bms_is_empty(fired)); + + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + + if (node->as_whichsyncplan < 0) + { + /* + * If no subplan has been chosen, we must choose one before + * proceeding. + */ + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && + !node->choose_next_subplan(node)) + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + + /* Nothing to do if there are no matching subplans */ + else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS) + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan in the appropriate direction. If no + * more subplans, return the empty slot set up for us by + * ExecInitAppend, unless there are async plans we have yet to finish. + */ + if (!node->choose_next_subplan(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} + /* ---------------------------------------------------------------- * ExecEndAppend * @@ -353,6 +538,15 @@ ExecReScanAppend(AppendState *node) node->as_valid_subplans = NULL; } + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -373,7 +567,7 @@ ExecReScanAppend(AppendState *node) } /* Let choose_next_subplan_* function handle setting the first subplan */ - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; } /* ---------------------------------------------------------------- @@ -461,7 +655,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt) static bool choose_next_subplan_locally(AppendState *node) { - int whichplan = node->as_whichplan; + int whichplan = node->as_whichsyncplan; int nextplan; /* We should never be called when there are no subplans */ @@ -494,7 +688,7 @@ choose_next_subplan_locally(AppendState *node) if (nextplan < 0) return false; - node->as_whichplan = nextplan; + node->as_whichsyncplan = nextplan; return true; } @@ -516,19 +710,19 @@ choose_next_subplan_for_leader(AppendState *node) Assert(ScanDirectionIsForward(node->ps.state->es_direction)); /* We should never be called when there are no subplans */ - Assert(node->as_whichplan != NO_MATCHING_SUBPLANS); + Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS); LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) { /* Mark just-completed subplan as finished. */ - node->as_pstate->pa_finished[node->as_whichplan] = true; + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; } else { /* Start with last subplan. */ - node->as_whichplan = node->as_nplans - 1; + node->as_whichsyncplan = node->as_nplans - 1; /* * If we've yet to determine the valid subplans for these parameters @@ -549,12 +743,12 @@ choose_next_subplan_for_leader(AppendState *node) } /* Loop until we find a subplan to execute. */ - while (pstate->pa_finished[node->as_whichplan]) + while (pstate->pa_finished[node->as_whichsyncplan]) { - if (node->as_whichplan == 0) + if (node->as_whichsyncplan == 0) { pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; LWLockRelease(&pstate->pa_lock); return false; } @@ -563,12 +757,12 @@ choose_next_subplan_for_leader(AppendState *node) * We needn't pay attention to as_valid_subplans here as all invalid * plans have been marked as finished. */ - node->as_whichplan--; + node->as_whichsyncplan--; } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < node->as_first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < node->as_first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); @@ -597,13 +791,13 @@ choose_next_subplan_for_worker(AppendState *node) Assert(ScanDirectionIsForward(node->ps.state->es_direction)); /* We should never be called when there are no subplans */ - Assert(node->as_whichplan != NO_MATCHING_SUBPLANS); + Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS); LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); /* Mark just-completed subplan as finished. */ - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; /* * If we've yet to determine the valid subplans for these parameters then @@ -625,7 +819,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Save the plan from which we are starting the search. */ - node->as_whichplan = pstate->pa_next_plan; + node->as_whichsyncplan = pstate->pa_next_plan; /* Loop until we find a valid subplan to execute. */ while (pstate->pa_finished[pstate->pa_next_plan]) @@ -639,7 +833,7 @@ choose_next_subplan_for_worker(AppendState *node) /* Advance to the next valid plan. */ pstate->pa_next_plan = nextplan; } - else if (node->as_whichplan > node->as_first_partial_plan) + else if (node->as_whichsyncplan > node->as_first_partial_plan) { /* * Try looping back to the first valid partial plan, if there is @@ -648,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node) nextplan = bms_next_member(node->as_valid_subplans, node->as_first_partial_plan - 1); pstate->pa_next_plan = - nextplan < 0 ? node->as_whichplan : nextplan; + nextplan < 0 ? node->as_whichsyncplan : nextplan; } else { @@ -656,10 +850,10 @@ choose_next_subplan_for_worker(AppendState *node) * At last plan, and either there are no partial plans or we've * tried them all. Arrange to bail out. */ - pstate->pa_next_plan = node->as_whichplan; + pstate->pa_next_plan = node->as_whichsyncplan; } - if (pstate->pa_next_plan == node->as_whichplan) + if (pstate->pa_next_plan == node->as_whichsyncplan) { /* We've tried everything! */ pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; @@ -669,7 +863,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Pick the plan we found, and advance pa_next_plan one more time. */ - node->as_whichplan = pstate->pa_next_plan; + node->as_whichsyncplan = pstate->pa_next_plan; pstate->pa_next_plan = bms_next_member(node->as_valid_subplans, pstate->pa_next_plan); @@ -696,8 +890,8 @@ choose_next_subplan_for_worker(AppendState *node) } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < node->as_first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < node->as_first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index a2a28b7ec2..915deb7080 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck); } - /* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan *) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node); } + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 7c045a7afe..8304dd5b17 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -246,6 +246,8 @@ _copyAppend(const Append *from) COPY_NODE_FIELD(appendplans); COPY_SCALAR_FIELD(first_partial_plan); COPY_NODE_FIELD(part_prune_infos); + COPY_SCALAR_FIELD(nasyncplans); + COPY_SCALAR_FIELD(referent); return newnode; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 1da9d7ed15..ed655f4ccb 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node) WRITE_NODE_FIELD(appendplans); WRITE_INT_FIELD(first_partial_plan); WRITE_NODE_FIELD(part_prune_infos); + WRITE_INT_FIELD(nasyncplans); + WRITE_INT_FIELD(referent); } static void diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 2826cec2f8..fb4ae251de 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -1652,6 +1652,8 @@ _readAppend(void) READ_NODE_FIELD(appendplans); READ_INT_FIELD(first_partial_plan); READ_NODE_FIELD(part_prune_infos); + READ_INT_FIELD(nasyncplans); + READ_INT_FIELD(referent); READ_DONE(); } diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 0317763f43..eda3420d02 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); static Append *make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels, List *partpruneinfos); + int nasyncplans, int referent, + List *tlist, + List *partitioned_rels, List *partpruneinfos); static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path); /* @@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) { Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; RelOptInfo *rel = best_path->path.parent; List *partpruneinfos = NIL; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all children return the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* + * Classify as async-capable or not. If we have decided to run the + * chidlren in parallel, we cannot any one of them run asynchronously. + */ + if (!best_path->path.parallel_safe && is_async_capable_path(subpath)) + { + subplan->async_capable = true; + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } if (enable_partition_pruning && @@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked to emit. */ - plan = make_append(subplans, best_path->first_partial_path, - tlist, best_path->partitioned_rels, - partpruneinfos); + plan = make_append(list_concat(asyncplans, syncplans), + best_path->first_partial_path, nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels, partpruneinfos); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist, } static Append * -make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels, - List *partpruneinfos) +make_append(List *appendplans, int first_partial_plan, int nasyncplans, + int referent, List *tlist, + List *partitioned_rels, List *partpruneinfos) { Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan, node->appendplans = appendplans; node->first_partial_plan = first_partial_plan; node->part_prune_infos = partpruneinfos; + node->nasyncplans = nasyncplans; + node->referent = referent; + return node; } @@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan) } return true; } + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 084573e77c..7aef97ca97 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 065238b0fe..fe202cbfea 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) dpns->planstate = ps; /* - * We special-case Append and MergeAppend to pretend that the first child + * We special-case Append and MergeAppend to pretend that a specific child * plan is the OUTER referent; we have to interpret OUTER Vars in their * tlists according to one of the children, and the first one is the most * natural choice. Likewise special-case ModifyTable to pretend that the @@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) * lists containing references to non-target relations. */ if (IsA(ps, AppendState)) - dpns->outer_planstate = ((AppendState *) ps)->appendplans[0]; + { + AppendState *aps = (AppendState *) ps; + Append *app = (Append *) ps->plan; + dpns->outer_planstate = aps->appendplans[app->referent]; + } else if (IsA(ps, MergeAppendState)) dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0]; else if (IsA(ps, ModifyTableState)) diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000000..5fd67d9004 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index a7ea3c7d10..8e9d87669f 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -63,6 +63,7 @@ #define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */ #define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */ #define EXEC_FLAG_WITH_NO_DATA 0x0080 /* rel scannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution */ /* Hook for plugins to get control in ExecutorStart() */ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index ccb66be733..67abf8e52e 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node, extern void ExecForeignScanInitializeWorker(ForeignScanState *node, ParallelWorkerContext *pwcxt); extern void ExecShutdownForeignScan(ForeignScanState *node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); #endif /* NODEFOREIGNSCAN_H */ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index c14eb546c6..c00e9621fb 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root, typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit); /* * FdwRoutine is the struct returned by a foreign-data wrapper's handler @@ -189,6 +194,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; @@ -241,6 +247,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index da7f52cab0..56bfe3f442 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for all PlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; + typedef struct PlanState { NodeTag type; @@ -953,6 +959,9 @@ typedef struct PlanState * descriptor, without encoding knowledge about all executor nodes. */ TupleDesc scandesc; + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */ } PlanState; /* ---------------- @@ -1087,14 +1096,20 @@ struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_whichsyncplan; /* which sync plan is being executed */ int as_first_partial_plan; /* Index of 'appendplans' containing * the first partial plan */ + int as_nasyncplans; /* # of async-capable children */ ParallelAppendState *as_pstate; /* parallel coordination info */ Size pstate_len; /* size of parallel coordination info */ struct PartitionPruneState *as_prune_state; Bitmapset *as_valid_subplans; bool (*choose_next_subplan) (AppendState *); + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */ }; /* ---------------- @@ -1643,6 +1658,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information */ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */ } ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index f2dda82e66..8a64c037c9 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -139,6 +139,11 @@ typedef struct Plan bool parallel_aware; /* engage parallel-aware logic? */ bool parallel_safe; /* OK to use as part of parallel plan? */ + /* + * information needed for asynchronous execution + */ + bool async_capable; /* engage asyncronous execution logic? */ + /* * Common structural data for all Plan types. */ @@ -262,6 +267,8 @@ typedef struct Append * Mapping details for run-time subplan pruning, one per partitioned_rels */ List *part_prune_infos; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */ } Append; /* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index be2f59239b..6f4583b46c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -832,7 +832,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT } WaitEventIPC; /* ---------- -- 2.16.3 From 072f6af8a2b394402e753a65569d64668e2cfe86 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw --- contrib/postgres_fdw/connection.c | 26 ++ contrib/postgres_fdw/expected/postgres_fdw.out | 100 ++-- contrib/postgres_fdw/postgres_fdw.c | 619 ++++++++++++++++++++++--- contrib/postgres_fdw/postgres_fdw.h | 2 + contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +- 5 files changed, 628 insertions(+), 139 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index fe4893a8e0..da7c826e4f 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ + void *storage; /* connection specific storage */ } ConnCacheEntry; /* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid, user->userid); + entry->storage = NULL; } /* @@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt) return entry->conn; } +/* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + /* * Connect to remote server using specified server and user mapping properties. */ diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 248aa73c0b..bb6b1a8fdf 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6968,13 +6968,12 @@ select * from bar where f1 in (select f1 from foo) for update; Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - Async subplans: 1 - -> Async Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Seq Scan on public.foo Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -(29 rows) + -> Foreign Scan on public.foo2 + Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 +(23 rows) select * from bar where f1 in (select f1 from foo) for update; f1 | f2 @@ -7007,13 +7006,12 @@ select * from bar where f1 in (select f1 from foo) for share; Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - Async subplans: 1 - -> Async Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Seq Scan on public.foo Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -(29 rows) + -> Foreign Scan on public.foo2 + Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 +(23 rows) select * from bar where f1 in (select f1 from foo) for share; f1 | f2 @@ -7045,8 +7043,9 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - Async subplans: 1 - -> Async Foreign Scan on public.foo2 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + -> Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join @@ -7062,13 +7061,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - Async subplans: 1 - -> Async Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Seq Scan on public.foo Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -(41 rows) + -> Foreign Scan on public.foo2 + Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 +(39 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); select tableoid::regclass, * from bar order by 1,2; @@ -7098,11 +7096,14 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - Async subplans: 2 - -> Async Foreign Scan on public.foo2 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Async Foreign Scan on public.foo2 foo2_1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) + -> Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 -> Hash @@ -7122,18 +7123,17 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - Async subplans: 2 - -> Async Foreign Scan on public.foo2 - Output: ROW(foo2.f1), foo2.f1 - Remote SQL: SELECT f1 FROM public.loct1 - -> Async Foreign Scan on public.foo2 foo2_1 - Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) - Remote SQL: SELECT f1 FROM public.loct1 -> Seq Scan on public.foo Output: ROW(foo.f1), foo.f1 + -> Foreign Scan on public.foo2 + Output: ROW(foo2.f1), foo2.f1 + Remote SQL: SELECT f1 FROM public.loct1 -> Seq Scan on public.foo foo_1 Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -(47 rows) + -> Foreign Scan on public.foo2 foo2_1 + Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) + Remote SQL: SELECT f1 FROM public.loct1 +(45 rows) update bar set f2 = f2 + 100 from @@ -8155,12 +8155,11 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J Sort Sort Key: t1.a, t3.c -> Append - Async subplans: 2 - -> Async Foreign Scan + -> Foreign Scan Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3) - -> Async Foreign Scan + -> Foreign Scan Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3) -(8 rows) +(7 rows) SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a% 25 =0 ORDER BY 1,2,3; a | b | c @@ -8179,10 +8178,9 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) Sort Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c -> Append - Async subplans: 1 - -> Async Foreign Scan + -> Foreign Scan Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) -(6 rows) +(5 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; a | b | c @@ -8202,12 +8200,11 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE Sort Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2) -> Append - Async subplans: 2 - -> Async Foreign Scan + -> Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Async Foreign Scan + -> Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(8 rows) +(7 rows) SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2; t1 | t2 @@ -8226,12 +8223,11 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t Sort Sort Key: t1.a, t1.b -> Append - Async subplans: 2 - -> Async Foreign Scan + -> Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Async Foreign Scan + -> Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(8 rows) +(7 rows) SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25= 0 ORDER BY 1,2; a | b @@ -8313,11 +8309,10 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Group Key: fpagg_tab_p1.a Filter: (avg(fpagg_tab_p1.b) < '22'::numeric) -> Append - Async subplans: 3 - -> Async Foreign Scan on fpagg_tab_p1 - -> Async Foreign Scan on fpagg_tab_p2 - -> Async Foreign Scan on fpagg_tab_p3 -(10 rows) + -> Foreign Scan on fpagg_tab_p1 + -> Foreign Scan on fpagg_tab_p2 + -> Foreign Scan on fpagg_tab_p3 +(9 rows) -- Plan with partitionwise aggregates is enabled SET enable_partitionwise_aggregate TO true; @@ -8328,14 +8323,13 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Sort Sort Key: fpagg_tab_p1.a -> Append - Async subplans: 3 - -> Async Foreign Scan + -> Foreign Scan Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab) - -> Async Foreign Scan + -> Foreign Scan Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab) - -> Async Foreign Scan + -> Foreign Scan Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab) -(10 rows) +(9 rows) SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1; a | sum | min | count diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 78b0f43ca8..8efbbf95a8 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -34,6 +36,7 @@ #include "optimizer/var.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" +#include "pgstat.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC; /* If no remote estimates, assume a sort costs 20% extra */ #define DEFAULT_FDW_SORT_MULTIPLIER 1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) + /* * Indexes of FDW-private information stored in fdw_private lists. * @@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex FdwDirectModifyPrivateSetProcessed }; +/* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *leader; /* leader node of this connection */ + bool busy; /* true if this connection is busy */ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + /* * Execution state of a foreign scan using postgres_fdw. */ typedef struct PgFdwScanState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed to query */ @@ -150,6 +173,12 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch */ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if last fetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool inqueue; /* true if this node is in waiter queue */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* last waiting node in waiting queue. + * valid only on the leader node */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +192,11 @@ typedef struct PgFdwScanState */ typedef struct PgFdwModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement, if created */ /* extracted fdw_private data */ @@ -190,6 +219,7 @@ typedef struct PgFdwModifyState */ typedef struct PgFdwDirectModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags); static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node); static void postgresReScanForeignScan(ForeignScanState *node); static void postgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node); static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, RelOptInfo *input_rel, RelOptInfo *output_rel, void *extra); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); /* * Helper functions @@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass *ec, EquivalenceMember *em, void *arg); static void create_cursor(ForeignScanState *node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn, bool clear_queue); static void close_cursor(PGconn *conn, unsigned int cursor_number); static PgFdwModifyState *create_foreign_modify(EState *estate, RangeTblEntry *rte, @@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets; @@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine); } @@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->leader = NULL; + fsstate->s.connpriv->busy = false; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->inqueue = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1408,40 +1460,250 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) &fsstate->param_values); } +/* + * Async queue manipuration functions + */ + +/* + * add_async_waiter: + * + * adds the node to the end of waiter queue + */ +static inline void +add_async_waiter(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + PgFdwScanState *leader_state; + PgFdwScanState *last_waiter_state; + + Assert(leader && leader != node); + + /* do nothing if the node is already in the queue */ + if (fsstate->inqueue) + return; + + leader_state = GetPgFdwScanState(leader); + last_waiter_state = GetPgFdwScanState(leader_state->last_waiter); + last_waiter_state->waiter = node; + leader_state->last_waiter = node; + fsstate->inqueue = true; +} + +/* + * move_to_next_waiter: + * + * Makes the first waiter be next leader + * Returns the new leader or NULL if there's no waiter. + */ +static inline ForeignScanState * +move_to_next_waiter(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *ret = fsstate->waiter; + + Assert(fsstate->s.connpriv->leader = node); + + if (ret) + { + PgFdwScanState *retstate = GetPgFdwScanState(ret); + fsstate->waiter = NULL; + retstate->last_waiter = fsstate->last_waiter; + retstate->inqueue = false; + } + + fsstate->s.connpriv->leader = ret; + + return ret; +} + +/* + * remove the node from waiter queue + * + * This is a bit different from the two above in the sense that this can + * operate on connection leader. The result is absorbed when this is called on + * active leader. + * + * Returns true if the node was found. + */ +static inline bool +remove_async_node(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + PgFdwScanState *leader_state; + ForeignScanState *prev; + PgFdwScanState *prev_state; + ForeignScanState *cur; + + /* no need to remove me */ + if (!leader || !fsstate->inqueue) + return false; + + leader_state = GetPgFdwScanState(leader); + + /* Remove the leader node */ + if (leader == node) + { + ForeignScanState *next_leader; + + if (leader_state->s.connpriv->busy) + { + /* + * this node is waiting for result, absorb the result first so + * that the following commands can be sent on the connection. + */ + PgFdwScanState *leader_state = GetPgFdwScanState(leader); + PGconn *conn = leader_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + + leader_state->s.connpriv->busy = false; + } + + /* Make the first waiter the leader */ + if (leader_state->waiter) + { + PgFdwScanState *next_leader_state; + + next_leader = leader_state->waiter; + next_leader_state = GetPgFdwScanState(next_leader); + + leader_state->s.connpriv->leader = next_leader; + next_leader_state->last_waiter = leader_state->last_waiter; + } + leader_state->waiter = NULL; + + return true; + } + + /* + * Just remove the node in queue + * + * This function is called on the shutdown path. We don't bother + * considering faster way to do this. + */ + prev = leader; + prev_state = leader_state; + cur = GetPgFdwScanState(prev)->waiter; + while (cur) + { + PgFdwScanState *curstate = GetPgFdwScanState(cur); + + if (cur == node) + { + prev_state->waiter = curstate->waiter; + if (leader_state->last_waiter == cur) + leader_state->last_waiter = prev; + else + leader_state->last_waiter = cur; + + fsstate->inqueue = false; + + return true; + } + prev = cur; + prev_state = curstate; + cur = curstate->waiter; + } + + return false; +} + /* * postgresIterateForeignScan - * Retrieve next row from the result set, or clear tuple slot to indicate - * EOF. + * Retrieve next row from the result set. + * + * For synchronous nodes, returns clear tuples slot to indicte EOF. + * + * If the node is asynchronous one, clear tuple slot has two meanings. + * If the caller receives clear tuple slot, asyncstate indicates wheter + * the node is EOF (AS_AVAILABLE) or waiting for data to + * come(AS_WAITING). */ static TupleTableSlot * postgresIterateForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; - /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); + if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached) + { + /* we've run out, get some more tuples */ + if (!node->fs_async) + { + /* finish running query to send my command */ + if (!fsstate->s.connpriv->busy) + vacate_connection((PgFdwState *)fsstate, false); + + request_more_data(node); + + /* + * Fetch the result immediately. This executes the next waiter if + * any. + */ + fetch_received_data(node); + } + else if (!fsstate->s.connpriv->busy) + { + /* If the connection is not busy, just send the request. */ + request_more_data(node); + } + else if (fsstate->s.connpriv->leader == node) + { + bool available = true; + + /* Check if the result is available */ + if (PQisBusy(fsstate->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(fsstate->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (!(rc & WL_SOCKET_READABLE)) + available = false; + } + + /* The next waiter is executed automatcically */ + if (available) + fetch_received_data(node); + } + else if (fsstate->s.connpriv->leader) + { + /* + * Anyone else is waiting on this connection then add this node to + * waiting queue. + */ + add_async_waiter(node); + } + } /* - * Get some more tuples, if we've run out. + * If we haven't received a result for the given node this time, + * return with no tuple to give way to another node. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ - if (fsstate->next_tuple >= fsstate->num_tuples) - return ExecClearTuple(slot); + if (fsstate->eof_reached) + { + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; + } + else + { + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + } + + return ExecClearTuple(slot); } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1457,7 +1719,7 @@ postgresIterateForeignScan(ForeignScanState *node) static void postgresReScanForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1465,6 +1727,8 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + vacate_connection((PgFdwState *)fsstate, true); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1493,9 +1757,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1513,7 +1777,7 @@ postgresReScanForeignScan(ForeignScanState *node) static void postgresEndForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1521,15 +1785,31 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulation of cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */ } +/* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* remove the node from waiting queue */ + remove_async_node(node); +} + /* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table @@ -1753,6 +2033,9 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1763,14 +2046,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1778,10 +2061,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1819,6 +2102,9 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1839,14 +2125,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1854,10 +2140,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1895,6 +2181,9 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1915,14 +2204,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1930,10 +2219,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2400,7 +2689,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Update the foreign-join-related fields. */ if (fsplan->scan.scanrelid == 0) @@ -2485,7 +2776,11 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, execute the statement. */ if (dmstate->num_tuples == -1) + { + /* finish running query to send my command */ + vacate_connection((PgFdwState *)dmstate, true); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2532,8 +2827,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remote connection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* close the target relation. */ if (dmstate->resultRel) @@ -2656,6 +2951,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoData sql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist = NIL; @@ -2698,6 +2994,18 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user, false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + + /* finish running query to send my command */ + vacate_connection(&tmpstate, true); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -3061,11 +3369,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, static void create_cursor(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -3128,50 +3436,121 @@ create_cursor(ForeignScanState *node) } /* - * Fetch some more rows from the node's cursor. + * Sends the next request of the node. If the given node is different from the + * current connection leader, pushes it back to waiter queue and let the given + * node be the leader. */ static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* must be non-busy */ + Assert(!fsstate->s.connpriv->busy); + /* must be not-eof */ + Assert(!fsstate->eof_reached); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->busy = true; + + /* Let the node be the leader if it is different from current one */ + if (leader != node) + { + /* + * If the connection leader exists, insert the node as the connection + * leader making the current leader be the first waiter. + */ + if (leader != NULL) + { + remove_async_node(node); + fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter; + fsstate->waiter = leader; + } + fsstate->s.connpriv->leader = node; + } +} + +/* + * Fetches received data and automatically send requests of the next waiter. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + ForeignScanState *waiter; + + /* I should be the current connection leader */ + Assert(fsstate->s.connpriv->leader == node); /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function. */ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res) != PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /* Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3181,26 +3560,76 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->busy = false; + if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->busy = false; + + /* let the first waiter be the next leader of this connection */ + waiter = move_to_next_waiter(node); + + /* send the next request if any */ + if (waiter) + request_more_data(waiter); + MemoryContextSwitchTo(oldcontext); } +/* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate, bool clear_queue) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *leader; + + /* the connection is alrady available */ + if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy) + return; + + /* + * let the current connection leader read the result for the running query + */ + leader = connpriv->leader; + fetch_received_data(leader); + + /* let the first waiter be the next leader of this connection */ + move_to_next_waiter(leader); + + if (!clear_queue) + return; + + /* Clear the waiting list */ + while (leader) + { + PgFdwScanState *fsstate = GetPgFdwScanState(leader); + + fsstate->last_waiter = NULL; + leader = fsstate->waiter; + fsstate->waiter = NULL; + } +} + /* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguous to the remote server. @@ -3314,7 +3743,9 @@ create_foreign_modify(EState *estate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statement not made yet */ /* Set up remote query information. */ @@ -3387,7 +3818,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement. */ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3397,12 +3828,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this module are simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -3410,9 +3841,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action shows that the prepare has been done. */ @@ -3537,16 +3968,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL; } /* @@ -3706,9 +4137,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assuming that the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check for success. @@ -3716,10 +4147,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -5203,6 +5634,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the join relation */ } +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection leader. Elsewise + * another node on this connection is the leader. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->leader == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + + /* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a side effect, save information we obtain in diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index a5d4011e8d..f344fb7f66 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode */ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINing ForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize); extern void ReleaseConnection(PGconn *conn); extern unsigned int GetCursorNumber(PGconn *conn); extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 231b1e01a5..8ecc903c20 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; @@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) @@ -1741,8 +1741,8 @@ explain (verbose, costs off) delete from foo where f1 < 5 returning *; delete from foo where f1 < 5 returning *; explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; -- Test that UPDATE/DELETE with inherited target works with row-level triggers CREATE TRIGGER trig_row_before -- 2.16.3
This gets further refactoring. At Fri, 11 May 2018 17:45:20 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180511.174520.188681124.horiguchi.kyotaro@lab.ntt.co.jp> > But, this is not just a rebased version. On the way fixing > serious conflicts, I refactored patch and I believe this becomes > way readable than the previous shape. > > - Waiting queue manipulation is moved into new functions. It had > a bug that the same node can be inserted in the queue more than > once and it is fixed. > > - postgresIterateForeginScan had somewhat a tricky strcuture to > merge similar procedures thus it cannot be said easy-to-read at > all. Now it is far simpler and straight-forward looking. > > - Still this works only on Append/ForeignScan. I performed almost the same test (again) as before but with some new things: - partition tables (There should be no difference with inheritance and it actually looks so.) - added test for fetch_size of 200 and 1000 as long as 100. Fetch size of 100 seems unreasonably magnifies the lag by context switching on single poor box for the test D/F below. They became faster by about twice by *adding* a small delay (1000 times of clock_gettime()(*1)) just before epoll_wait. Things would be different on separate machines but I'm not sure it really is. I don't find the exact cause nor how to avoid it. *1: The reason for the function is that I found at first that the queries get way faster by just prefixing by "explain analyze".. Async append (theoretically) no longer affects non-async path at all so B is expected to get no degradation. It seems within error. C and F are the gain when all foreign tables share one connection and D and G are the gain when every foreign tables has dedicate connection. (previous numbers) > patched(ms) unpatched(ms) gain(%) > A: simple table scan : 3562.32 3444.81 -3.4 > B: local partitioning : 1451.25 1604.38 9.5 > C: single remote table : 8818.92 9297.76 5.1 > D: sharding (single con) : 5966.14 6646.73 10.2 > E: sharding (multi con) : 1802.25 6515.49 72.3 fetch_size = 100 patched(ms) unpatched(ms) gain(%) A: simple table scan : 3065.82 3046.82 -0.62 B: local partitioning : 1393.98 1378.00 -1.16 C: single remote table : 8499.73 8595.66 1.12 D: sharding (single con) : 9267.85 9251.59 -0.18 E: sharding (multi con) : 2567.02 9295.22 72.38 F: partition (single con): 9241.08 9060.19 -2.00 G: partition (multi con) : 2548.86 9419.18 72.94 fetch_size = 200 patched(ms) unpatched(ms) gain(%) A: simple table scan : 3067.08 2999.23 -2.3 B: local partitioning : 1392.07 1384.49 -0.5 C: single remote table : 8521.72 8505.48 -0.2 D: sharding (single con) : 6752.81 7076.02 4.6 E: sharding (multi con) : 1958.2 7188.02 72.8 F: partition (single con): 6756.72 7000.72 3.5 G: partition (multi con) : 1969.8 7228.85 72.8 fetch_size = 1000 patched(ms) unpatched(ms) gain(%) A: simple table scan : 4547.44 4519.34 -0.62 B: local partitioning : 2880.66 2739.43 -5.16 C: single remote table : 8448.04 8572.15 1.45 D: sharding (single con) : 2405.01 5919.31 59.37 E: sharding (multi con) : 1872.15 5963.04 68.60 F: partition (single con): 2369.08 5960.81 60.26 G: partition (multi con) : 1854.69 5893.65 68.53 regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 54f85c159f3feee5ee2dac6daacc7330ec101ed5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c | 2 +- src/backend/storage/ipc/latch.c | 18 ++++++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 67 +++++++++++++++++++++++++++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 ++++ 6 files changed, 96 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index a4f6d4deeb..890972b9b8 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index e6706f7fb8..5457899f2d 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index ef1d5baf01..30edc8e83a 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) { WaitEventSet *new_event_set; - new_event_set = CreateWaitEventSet(TopMemoryContext, 2); + new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2); AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index bce021e100..802b79a660 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -126,6 +126,7 @@ typedef struct ResourceOwnerData ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ ResourceArray jitarr; /* JIT contexts */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc); static void PrintSnapshotLeakWarning(Snapshot snapshot); static void PrintFileLeakWarning(File file); static void PrintDSMLeakWarning(dsm_segment *seg); +static void PrintWESLeakWarning(WaitEventSet *events); /***************************************************************************** @@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) ResourceArrayInit(&(owner->filearr), FileGetDatum(-1)); ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL)); ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL)); + ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL)); return owner; } @@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner, jit_release_context(context); } + + /* Ditto for wait event sets */ + while (ResourceArrayGetAny(&(owner->wesarr), &foundres)) + { + WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres); + + if (isCommit) + PrintWESLeakWarning(event); + FreeWaitEventSet(event); + } } else if (phase == RESOURCE_RELEASE_LOCKS) { @@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner) Assert(owner->filearr.nitems == 0); Assert(owner->dsmarr.nitems == 0); Assert(owner->jitarr.nitems == 0); + Assert(owner->wesarr.nitems == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1); /* @@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner) ResourceArrayFree(&(owner->filearr)); ResourceArrayFree(&(owner->dsmarr)); ResourceArrayFree(&(owner->jitarr)); + ResourceArrayFree(&(owner->wesarr)); pfree(owner); } @@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle) elog(ERROR, "JIT context %p is not owned by resource owner %s", DatumGetPointer(handle), owner->name); } + +/* + * wait event set reference array. + * + * This is separate from actually inserting an entry because if we run out + * of memory, it's critical to do so *before* acquiring the resource. + */ +void +ResourceOwnerEnlargeWESs(ResourceOwner owner) +{ + ResourceArrayEnlarge(&(owner->wesarr)); +} + +/* + * Remember that a wait event set is owned by a ResourceOwner + * + * Caller must have previously done ResourceOwnerEnlargeWESs() + */ +void +ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events) +{ + ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events)); +} + +/* + * Forget that a wait event set is owned by a ResourceOwner + */ +void +ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events))) + elog(ERROR, "wait event set %p is not owned by resource owner %s", + events, owner->name); +} + +/* + * Debugging subroutine + */ +static void +PrintWESLeakWarning(WaitEventSet *events) +{ + /* + * XXXX: There's no property to show as an identier of a wait event set, + * use its pointer instead. + */ + elog(WARNING, "wait event set leak: %p still referenced", + events); +} diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index a4bcb48874..838845af01 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -101,6 +101,7 @@ #define LATCH_H #include <signal.h> +#include "utils/resowner.h" /* * Latch structure should be treated as opaque and only accessed through @@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch); extern void SetLatch(volatile Latch *latch); extern void ResetLatch(volatile Latch *latch); -extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents); +extern WaitEventSet *CreateWaitEventSet(MemoryContext context, + ResourceOwner res, int nevents); extern void FreeWaitEventSet(WaitEventSet *set); extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, void *user_data); diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h index a6e8eb71ab..3c06e4c3f8 100644 --- a/src/include/utils/resowner_private.h +++ b/src/include/utils/resowner_private.h @@ -18,6 +18,7 @@ #include "storage/dsm.h" #include "storage/fd.h" +#include "storage/latch.h" #include "storage/lock.h" #include "utils/catcache.h" #include "utils/plancache.h" @@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner, extern void ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle); +/* support for wait event set management */ +extern void ResourceOwnerEnlargeWESs(ResourceOwner owner); +extern void ResourceOwnerRememberWES(ResourceOwner owner, + WaitEventSet *); +extern void ResourceOwnerForgetWES(ResourceOwner owner, + WaitEventSet *); + #endif /* RESOWNER_PRIVATE_H */ -- 2.16.3 From 19ff6af521070b8245f4bd04bd535a5286be1509 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 15 May 2018 20:21:32 +0900 Subject: [PATCH 2/3] infrastructure for asynchronous execution This patch add an infrastructure for asynchronous execution. As a PoC this makes only Append capable to handle asynchronously executable subnodes. --- src/backend/commands/explain.c | 17 ++ src/backend/executor/Makefile | 2 +- src/backend/executor/execAsync.c | 145 ++++++++++++++++ src/backend/executor/nodeAppend.c | 285 ++++++++++++++++++++++++++++---- src/backend/executor/nodeForeignscan.c | 22 ++- src/backend/nodes/bitmapset.c | 72 ++++++++ src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c | 2 + src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/plan/createplan.c | 68 +++++++- src/backend/postmaster/pgstat.c | 3 + src/backend/utils/adt/ruleutils.c | 8 +- src/include/executor/execAsync.h | 23 +++ src/include/executor/executor.h | 1 + src/include/executor/nodeForeignscan.h | 3 + src/include/foreign/fdwapi.h | 11 ++ src/include/nodes/bitmapset.h | 1 + src/include/nodes/execnodes.h | 18 +- src/include/nodes/plannodes.h | 7 + src/include/pgstat.h | 3 +- 20 files changed, 646 insertions(+), 49 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 73d94b7235..09c5327cb4 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es); static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ExplainState *es); +static void show_append_info(AppendState *astate, ExplainState *es); static void show_agg_keys(AggState *astate, List *ancestors, ExplainState *es); static void show_grouping_sets(PlanState *planstate, Agg *agg, @@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors, } if (plan->parallel_aware) appendStringInfoString(es->str, "Parallel "); + if (plan->async_capable) + appendStringInfoString(es->str, "Async "); appendStringInfoString(es->str, pname); es->indent++; } @@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_Hash: show_hash_info(castNode(HashState, planstate), es); break; + + case T_Append: + show_append_info(castNode(AppendState, planstate), es); + break; + default: break; } @@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ancestors, es); } +static void +show_append_info(AppendState *astate, ExplainState *es) +{ + Append *plan = (Append *) astate->ps.plan; + + if (plan->nasyncplans > 0) + ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es); +} + /* * Show the grouping keys for an Agg node. */ diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile index cc09895fa5..8ad2adfe1c 100644 --- a/src/backend/executor/Makefile +++ b/src/backend/executor/Makefile @@ -12,7 +12,7 @@ subdir = src/backend/executor top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \ +OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \ execGrouping.o execIndexing.o execJunk.o \ execMain.o execParallel.o execPartition.o execProcnode.o \ execReplication.o execScan.o execSRF.o execTuples.o \ diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 0000000000..db477e2cf6 --- /dev/null +++ b/src/backend/executor/execAsync.c @@ -0,0 +1,145 @@ +/*------------------------------------------------------------------------- + * + * execAsync.c + * Support routines for asynchronous execution. + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "executor/execAsync.h" +#include "executor/nodeAppend.h" +#include "executor/nodeForeignscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "utils/memutils.h" +#include "utils/resowner.h" + +void ExecAsyncSetState(PlanState *pstate, AsyncState status) +{ + pstate->asyncstate = status; +} + +bool +ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit) +{ + switch (nodeTag(node)) + { + case T_ForeignScanState: + return ExecForeignAsyncConfigureWait((ForeignScanState *)node, + wes, data, reinit); + break; + default: + elog(ERROR, "unrecognized node type: %d", + (int) nodeTag(node)); + } +} + +/* + * struct for memory context callback argument used in ExecAsyncEventWait + */ +typedef struct { + int **p_refind; + int *p_refindsize; +} ExecAsync_mcbarg; + +/* + * callback function to reset static variables pointing to the memory in + * TopTransactionContext in ExecAsyncEventWait. + */ +static void ExecAsyncMemoryContextCallback(void *arg) +{ + /* arg is the address of the variable refind in ExecAsyncEventWait */ + ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg; + *mcbarg->p_refind = NULL; + *mcbarg->p_refindsize = 0; +} + +#define EVENT_BUFFER_SIZE 16 + +Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; + WaitEventSet *wes; + WaitEvent occurred_event[EVENT_BUFFER_SIZE]; + int noccurred = 0; + Bitmapset *fired_events = NULL; + int i; + int n; + + n = bms_num_members(waitnodes); + wes = CreateWaitEventSet(TopTransactionContext, + TopTransactionResourceOwner, n); + if (refindsize < n) + { + if (refindsize == 0) + refindsize = EVENT_BUFFER_SIZE; /* XXX */ + while (refindsize < n) + refindsize *= 2; + if (refind) + refind = (int *) repalloc(refind, refindsize * sizeof(int)); + else + { + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; + MemoryContext oldctxt = + MemoryContextSwitchTo(TopTransactionContext); + + /* + * refind points to a memory block in + * TopTransactionContext. Register a callback to reset it. + */ + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb); + refind = (int *) palloc(refindsize * sizeof(int)); + MemoryContextSwitchTo(oldctxt); + } + } + + n = 0; + for (i = bms_next_member(waitnodes, -1) ; i >= 0 ; + i = bms_next_member(waitnodes, i)) + { + refind[i] = i; + if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true)) + n++; + } + + if (n == 0) + { + FreeWaitEventSet(wes); + return NULL; + } + + noccurred = WaitEventSetWait(wes, timeout, occurred_event, + EVENT_BUFFER_SIZE, + WAIT_EVENT_ASYNC_WAIT); + FreeWaitEventSet(wes); + if (noccurred == 0) + return NULL; + + for (i = 0 ; i < noccurred ; i++) + { + WaitEvent *w = &occurred_event[i]; + + if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0) + { + int n = *(int*)w->user_data; + + fired_events = bms_add_member(fired_events, n); + } + } + + return fired_events; +} diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 6bc3e470bf..94fafe72fb 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -60,6 +60,7 @@ #include "executor/execdebug.h" #include "executor/execPartition.h" #include "executor/nodeAppend.h" +#include "executor/execAsync.h" #include "miscadmin.h" /* Shared state for parallel-aware Append. */ @@ -81,6 +82,7 @@ struct ParallelAppendState #define NO_MATCHING_SUBPLANS -2 static TupleTableSlot *ExecAppend(PlanState *pstate); +static TupleTableSlot *ExecAppendAsync(PlanState *pstate); static bool choose_next_subplan_locally(AppendState *node); static bool choose_next_subplan_for_leader(AppendState *node); static bool choose_next_subplan_for_worker(AppendState *node); @@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags) PlanState **appendplanstates; Bitmapset *validsubplans; int nplans; + int nasyncplans; int firstvalid; int i, j; ListCell *lc; /* check for unsupported flags */ - Assert(!(eflags & EXEC_FLAG_MARK)); + Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC))); /* * Lock the non-leaf tables in the partition tree controlled by this node. @@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ appendstate->ps.plan = (Plan *) node; appendstate->ps.state = estate; - appendstate->ps.ExecProcNode = ExecAppend; + + /* choose appropriate version of Exec function */ + if (node->nasyncplans == 0) + appendstate->ps.ExecProcNode = ExecAppend; + else + appendstate->ps.ExecProcNode = ExecAppendAsync; /* Let choose_next_subplan_* function handle setting the first subplan */ - appendstate->as_whichplan = INVALID_SUBPLAN_INDEX; + appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX; /* If run-time partition pruning is enabled, then set that up now */ if (node->part_prune_infos != NIL) @@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ if (bms_is_empty(validsubplans)) { - appendstate->as_whichplan = NO_MATCHING_SUBPLANS; + appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS; /* Mark the first as valid so that it's initialized below */ validsubplans = bms_make_singleton(0); @@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags) */ j = i = 0; firstvalid = nplans; + nasyncplans = 0; foreach(lc, node->appendplans) { if (bms_is_member(i, validsubplans)) { Plan *initNode = (Plan *) lfirst(lc); + int sub_eflags = eflags; + + /* Let async-capable subplans run asynchronously */ + if (i < node->nasyncplans) + { + sub_eflags |= EXEC_FLAG_ASYNC; + nasyncplans++; + } /* * Record the lowest appendplans index which is a valid partial @@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) if (i >= node->first_partial_plan && j < firstvalid) firstvalid = j; - appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); + appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags); } i++; } @@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate->appendplans = appendplanstates; appendstate->as_nplans = nplans; + /* fill in async stuff */ + appendstate->as_nasyncplans = nasyncplans; + appendstate->as_syncdone = (nasyncplans == nplans); + + if (appendstate->as_nasyncplans) + { + appendstate->as_asyncresult = (TupleTableSlot **) + palloc0(node->nasyncplans * sizeof(TupleTableSlot *)); + + /* initially, all async requests need a request */ + for (i = 0; i < appendstate->as_nasyncplans; ++i) + appendstate->as_needrequest = + bms_add_member(appendstate->as_needrequest, i); + } + /* * Miscellaneous initialization */ @@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate) { AppendState *node = castNode(AppendState, pstate); - if (node->as_whichplan < 0) + if (node->as_whichsyncplan < 0) { /* * If no subplan has been chosen, we must choose one before * proceeding. */ - if (node->as_whichplan == INVALID_SUBPLAN_INDEX && + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && !node->choose_next_subplan(node)) return ExecClearTuple(node->ps.ps_ResultTupleSlot); /* Nothing to do if there are no matching subplans */ - else if (node->as_whichplan == NO_MATCHING_SUBPLANS) + else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS) return ExecClearTuple(node->ps.ps_ResultTupleSlot); } + Assert(node->as_nasyncplans == 0); + for (;;) { PlanState *subnode; @@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate) /* * figure out which subplan we are currently processing */ - Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans); - subnode = node->appendplans[node->as_whichplan]; + Assert(node->as_whichsyncplan >= 0 && + node->as_whichsyncplan < node->as_nplans); + subnode = node->appendplans[node->as_whichsyncplan]; /* * get a tuple from the subplan @@ -307,6 +342,175 @@ ExecAppend(PlanState *pstate) } } +static TupleTableSlot * +ExecAppendAsync(PlanState *pstate) +{ + AppendState *node = castNode(AppendState, pstate); + Bitmapset *needrequest; + int i; + + Assert(node->as_nasyncplans > 0); + +restart: + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + needrequest = node->as_needrequest; + node->as_needrequest = NULL; + while ((i = bms_first_member(needrequest)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + } + else + node->as_pending_async = bms_add_member(node->as_pending_async, i); + } + bms_free(needrequest); + + for (;;) + { + TupleTableSlot *result; + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + while (!bms_is_empty(node->as_pending_async)) + { + long timeout = node->as_syncdone ? -1 : 0; + Bitmapset *fired; + int i; + + fired = ExecAsyncEventWait(node->appendplans, + node->as_pending_async, + timeout); + + if (bms_is_empty(fired) && node->as_syncdone) + { + /* + * No subplan fired. This happens when even in normal + * operation where the subnode already prepared results before + * waiting. as_pending_result is storing stale information so + * restart from the beginning. + */ + node->as_needrequest = node->as_pending_async; + node->as_pending_async = NULL; + goto restart; + } + + while ((i = bms_first_member(fired)) >= 0) + { + TupleTableSlot *slot; + PlanState *subnode = node->appendplans[i]; + slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE) + { + if (!TupIsNull(slot)) + { + node->as_asyncresult[node->as_nasyncresult++] = slot; + node->as_needrequest = + bms_add_member(node->as_needrequest, i); + } + node->as_pending_async = + bms_del_member(node->as_pending_async, i); + } + } + bms_free(fired); + + /* return now if a result is available */ + if (node->as_nasyncresult > 0) + { + --node->as_nasyncresult; + return node->as_asyncresult[node->as_nasyncresult]; + } + + if (!node->as_syncdone) + break; + } + + /* + * If there is no asynchronous activity still pending and the + * synchronous activity is also complete, we're totally done scanning + * this node. Otherwise, we're done with the asynchronous stuff but + * must continue scanning the synchronous children. + */ + if (node->as_syncdone) + { + Assert(bms_is_empty(node->as_pending_async)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* + * get a tuple from the subplan + */ + + if (node->as_whichsyncplan < 0) + { + /* + * If no subplan has been chosen, we must choose one before + * proceeding. + */ + if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX && + !node->choose_next_subplan(node)) + { + node->as_syncdone = true; + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + + /* Nothing to do if there are no matching subplans */ + else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS) + { + node->as_syncdone = true; + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + result = ExecProcNode(node->appendplans[node->as_whichsyncplan]); + + if (!TupIsNull(result)) + { + /* + * If the subplan gave us something then return it as-is. We do + * NOT make use of the result slot that was set up in + * ExecInitAppend; there's no need for it. + */ + return result; + } + + /* + * Go on to the "next" subplan. If no more subplans, return the empty + * slot set up for us by ExecInitAppend, unless there are async plans + * we have yet to finish. + */ + if (!node->choose_next_subplan(node)) + { + node->as_syncdone = true; + if (bms_is_empty(node->as_pending_async)) + { + Assert(bms_is_empty(node->as_needrequest)); + return ExecClearTuple(node->ps.ps_ResultTupleSlot); + } + } + + /* Else loop back and try to get a tuple from the new subplan */ + } +} + /* ---------------------------------------------------------------- * ExecEndAppend * @@ -353,6 +557,15 @@ ExecReScanAppend(AppendState *node) node->as_valid_subplans = NULL; } + /* Reset async state. */ + for (i = 0; i < node->as_nasyncplans; ++i) + { + ExecShutdownNode(node->appendplans[i]); + node->as_needrequest = bms_add_member(node->as_needrequest, i); + } + node->as_nasyncresult = 0; + node->as_syncdone = (node->as_nasyncplans == node->as_nplans); + for (i = 0; i < node->as_nplans; i++) { PlanState *subnode = node->appendplans[i]; @@ -373,7 +586,7 @@ ExecReScanAppend(AppendState *node) } /* Let choose_next_subplan_* function handle setting the first subplan */ - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; } /* ---------------------------------------------------------------- @@ -461,7 +674,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt) static bool choose_next_subplan_locally(AppendState *node) { - int whichplan = node->as_whichplan; + int whichplan = node->as_whichsyncplan; int nextplan; /* We should never be called when there are no subplans */ @@ -480,6 +693,10 @@ choose_next_subplan_locally(AppendState *node) node->as_valid_subplans = ExecFindMatchingSubPlans(node->as_prune_state); + /* Exclude async plans */ + if (node->as_nasyncplans > 0) + bms_del_range(node->as_valid_subplans, 0, node->as_nasyncplans - 1); + whichplan = -1; } @@ -494,7 +711,7 @@ choose_next_subplan_locally(AppendState *node) if (nextplan < 0) return false; - node->as_whichplan = nextplan; + node->as_whichsyncplan = nextplan; return true; } @@ -516,19 +733,19 @@ choose_next_subplan_for_leader(AppendState *node) Assert(ScanDirectionIsForward(node->ps.state->es_direction)); /* We should never be called when there are no subplans */ - Assert(node->as_whichplan != NO_MATCHING_SUBPLANS); + Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS); LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) { /* Mark just-completed subplan as finished. */ - node->as_pstate->pa_finished[node->as_whichplan] = true; + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; } else { /* Start with last subplan. */ - node->as_whichplan = node->as_nplans - 1; + node->as_whichsyncplan = node->as_nplans - 1; /* * If we've yet to determine the valid subplans for these parameters @@ -549,12 +766,12 @@ choose_next_subplan_for_leader(AppendState *node) } /* Loop until we find a subplan to execute. */ - while (pstate->pa_finished[node->as_whichplan]) + while (pstate->pa_finished[node->as_whichsyncplan]) { - if (node->as_whichplan == 0) + if (node->as_whichsyncplan == 0) { pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; - node->as_whichplan = INVALID_SUBPLAN_INDEX; + node->as_whichsyncplan = INVALID_SUBPLAN_INDEX; LWLockRelease(&pstate->pa_lock); return false; } @@ -563,12 +780,12 @@ choose_next_subplan_for_leader(AppendState *node) * We needn't pay attention to as_valid_subplans here as all invalid * plans have been marked as finished. */ - node->as_whichplan--; + node->as_whichsyncplan--; } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < node->as_first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < node->as_first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); @@ -597,13 +814,13 @@ choose_next_subplan_for_worker(AppendState *node) Assert(ScanDirectionIsForward(node->ps.state->es_direction)); /* We should never be called when there are no subplans */ - Assert(node->as_whichplan != NO_MATCHING_SUBPLANS); + Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS); LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE); /* Mark just-completed subplan as finished. */ - if (node->as_whichplan != INVALID_SUBPLAN_INDEX) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; /* * If we've yet to determine the valid subplans for these parameters then @@ -625,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Save the plan from which we are starting the search. */ - node->as_whichplan = pstate->pa_next_plan; + node->as_whichsyncplan = pstate->pa_next_plan; /* Loop until we find a valid subplan to execute. */ while (pstate->pa_finished[pstate->pa_next_plan]) @@ -639,7 +856,7 @@ choose_next_subplan_for_worker(AppendState *node) /* Advance to the next valid plan. */ pstate->pa_next_plan = nextplan; } - else if (node->as_whichplan > node->as_first_partial_plan) + else if (node->as_whichsyncplan > node->as_first_partial_plan) { /* * Try looping back to the first valid partial plan, if there is @@ -648,7 +865,7 @@ choose_next_subplan_for_worker(AppendState *node) nextplan = bms_next_member(node->as_valid_subplans, node->as_first_partial_plan - 1); pstate->pa_next_plan = - nextplan < 0 ? node->as_whichplan : nextplan; + nextplan < 0 ? node->as_whichsyncplan : nextplan; } else { @@ -656,10 +873,10 @@ choose_next_subplan_for_worker(AppendState *node) * At last plan, and either there are no partial plans or we've * tried them all. Arrange to bail out. */ - pstate->pa_next_plan = node->as_whichplan; + pstate->pa_next_plan = node->as_whichsyncplan; } - if (pstate->pa_next_plan == node->as_whichplan) + if (pstate->pa_next_plan == node->as_whichsyncplan) { /* We've tried everything! */ pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; @@ -669,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node) } /* Pick the plan we found, and advance pa_next_plan one more time. */ - node->as_whichplan = pstate->pa_next_plan; + node->as_whichsyncplan = pstate->pa_next_plan; pstate->pa_next_plan = bms_next_member(node->as_valid_subplans, pstate->pa_next_plan); @@ -696,8 +913,8 @@ choose_next_subplan_for_worker(AppendState *node) } /* If non-partial, immediately mark as finished. */ - if (node->as_whichplan < node->as_first_partial_plan) - node->as_pstate->pa_finished[node->as_whichplan] = true; + if (node->as_whichsyncplan < node->as_first_partial_plan) + node->as_pstate->pa_finished[node->as_whichsyncplan] = true; LWLockRelease(&pstate->pa_lock); diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c index a2a28b7ec2..915deb7080 100644 --- a/src/backend/executor/nodeForeignscan.c +++ b/src/backend/executor/nodeForeignscan.c @@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate) (ExecScanRecheckMtd) ForeignRecheck); } - /* ---------------------------------------------------------------- * ExecInitForeignScan * ---------------------------------------------------------------- @@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags) scanstate->ss.ps.plan = (Plan *) node; scanstate->ss.ps.state = estate; scanstate->ss.ps.ExecProcNode = ExecForeignScan; + scanstate->ss.ps.asyncstate = AS_AVAILABLE; + + if ((eflags & EXEC_FLAG_ASYNC) != 0) + scanstate->fs_async = true; /* * Miscellaneous initialization @@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node) if (fdwroutine->ShutdownForeignScan) fdwroutine->ShutdownForeignScan(node); } + +/* ---------------------------------------------------------------- + * ExecAsyncForeignScanConfigureWait + * + * In async mode, configure for a wait + * ---------------------------------------------------------------- + */ +bool +ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + FdwRoutine *fdwroutine = node->fdwroutine; + + Assert(fdwroutine->ForeignAsyncConfigureWait != NULL); + return fdwroutine->ForeignAsyncConfigureWait(node, wes, + caller_data, reinit); +} diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 9bf9a29d6b..b2ab879d49 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -922,6 +922,78 @@ bms_add_range(Bitmapset *a, int lower, int upper) return a; } +/* + * bms_del_range + * Delete members in the range of 'lower' to 'upper' from the set. + * + * Note this could also be done by calling bms_del_member in a loop, however, + * using this function will be faster when the range is large as we work at + * the bitmapword level rather than at bit level. + */ +Bitmapset * +bms_del_range(Bitmapset *a, int lower, int upper) +{ + int lwordnum, + lbitnum, + uwordnum, + ushiftbits, + wordnum; + + if (lower < 0 || upper < 0) + elog(ERROR, "negative bitmapset member not allowed"); + if (lower > upper) + elog(ERROR, "lower range must not be above upper range"); + uwordnum = WORDNUM(upper); + + if (a == NULL) + { + a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1)); + a->nwords = uwordnum + 1; + } + + /* ensure we have enough words to store the upper bit */ + else if (uwordnum >= a->nwords) + { + int oldnwords = a->nwords; + int i; + + a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1)); + a->nwords = uwordnum + 1; + /* zero out the enlarged portion */ + for (i = oldnwords; i < a->nwords; i++) + a->words[i] = 0; + } + + wordnum = lwordnum = WORDNUM(lower); + + lbitnum = BITNUM(lower); + ushiftbits = BITNUM(upper) + 1; + + /* + * Special case when lwordnum is the same as uwordnum we must perform the + * upper and lower masking on the word. + */ + if (lwordnum == uwordnum) + { + a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1) + | (~(bitmapword) 0) << ushiftbits); + } + else + { + /* turn off lbitnum and all bits left of it */ + a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1); + + /* turn off all bits for any intermediate words */ + while (wordnum < uwordnum) + a->words[wordnum++] = (bitmapword) 0; + + /* turn off upper's bit and all bits right of it. */ + a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits; + } + + return a; +} + /* * bms_int_members - like bms_intersect, but left input is recycled */ diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 7c045a7afe..8304dd5b17 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -246,6 +246,8 @@ _copyAppend(const Append *from) COPY_NODE_FIELD(appendplans); COPY_SCALAR_FIELD(first_partial_plan); COPY_NODE_FIELD(part_prune_infos); + COPY_SCALAR_FIELD(nasyncplans); + COPY_SCALAR_FIELD(referent); return newnode; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 1da9d7ed15..ed655f4ccb 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node) WRITE_NODE_FIELD(appendplans); WRITE_INT_FIELD(first_partial_plan); WRITE_NODE_FIELD(part_prune_infos); + WRITE_INT_FIELD(nasyncplans); + WRITE_INT_FIELD(referent); } static void diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 2826cec2f8..fb4ae251de 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -1652,6 +1652,8 @@ _readAppend(void) READ_NODE_FIELD(appendplans); READ_INT_FIELD(first_partial_plan); READ_NODE_FIELD(part_prune_infos); + READ_INT_FIELD(nasyncplans); + READ_INT_FIELD(referent); READ_DONE(); } diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 0317763f43..eda3420d02 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual, Index scanrelid, int wtParam); static Append *make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels, List *partpruneinfos); + int nasyncplans, int referent, + List *tlist, + List *partitioned_rels, List *partpruneinfos); static RecursiveUnion *make_recursive_union(List *tlist, Plan *lefttree, Plan *righttree, @@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root, List *rowMarks, OnConflictExpr *onconflict, int epqParam); static GatherMerge *create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path); +static bool is_async_capable_path(Path *path); /* @@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) { Append *plan; List *tlist = build_path_tlist(root, &best_path->path); - List *subplans = NIL; + List *asyncplans = NIL; + List *syncplans = NIL; ListCell *subpaths; RelOptInfo *rel = best_path->path.parent; List *partpruneinfos = NIL; + int nasyncplans = 0; + bool first = true; + bool referent_is_sync = true; /* * The subpaths list could be empty, if every child was proven empty by @@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) /* Must insist that all children return the same tlist */ subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST); - subplans = lappend(subplans, subplan); + /* + * Classify as async-capable or not. If we have decided to run the + * chidlren in parallel, we cannot any one of them run asynchronously. + */ + if (!best_path->path.parallel_safe && is_async_capable_path(subpath)) + { + subplan->async_capable = true; + asyncplans = lappend(asyncplans, subplan); + ++nasyncplans; + if (first) + referent_is_sync = false; + } + else + syncplans = lappend(syncplans, subplan); + + first = false; } if (enable_partition_pruning && @@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path) * parent-rel Vars it'll be asked to emit. */ - plan = make_append(subplans, best_path->first_partial_path, - tlist, best_path->partitioned_rels, - partpruneinfos); + plan = make_append(list_concat(asyncplans, syncplans), + best_path->first_partial_path, nasyncplans, + referent_is_sync ? nasyncplans : 0, tlist, + best_path->partitioned_rels, partpruneinfos); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist, } static Append * -make_append(List *appendplans, int first_partial_plan, - List *tlist, List *partitioned_rels, - List *partpruneinfos) +make_append(List *appendplans, int first_partial_plan, int nasyncplans, + int referent, List *tlist, + List *partitioned_rels, List *partpruneinfos) { Append *node = makeNode(Append); Plan *plan = &node->plan; @@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan, node->appendplans = appendplans; node->first_partial_plan = first_partial_plan; node->part_prune_infos = partpruneinfos; + node->nasyncplans = nasyncplans; + node->referent = referent; + return node; } @@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan) } return true; } + +/* + * is_projection_capable_path + * Check whether a given Path node is async-capable. + */ +static bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } + default: + break; + } + return false; +} diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 084573e77c..7aef97ca97 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_ASYNC_WAIT: + event_name = "AsyncExecWait"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 065238b0fe..fe202cbfea 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) dpns->planstate = ps; /* - * We special-case Append and MergeAppend to pretend that the first child + * We special-case Append and MergeAppend to pretend that a specific child * plan is the OUTER referent; we have to interpret OUTER Vars in their * tlists according to one of the children, and the first one is the most * natural choice. Likewise special-case ModifyTable to pretend that the @@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps) * lists containing references to non-target relations. */ if (IsA(ps, AppendState)) - dpns->outer_planstate = ((AppendState *) ps)->appendplans[0]; + { + AppendState *aps = (AppendState *) ps; + Append *app = (Append *) ps->plan; + dpns->outer_planstate = aps->appendplans[app->referent]; + } else if (IsA(ps, MergeAppendState)) dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0]; else if (IsA(ps, ModifyTableState)) diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h new file mode 100644 index 0000000000..5fd67d9004 --- /dev/null +++ b/src/include/executor/execAsync.h @@ -0,0 +1,23 @@ +/*-------------------------------------------------------------------- + * execAsync.c + * Support functions for asynchronous query execution + * + * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/executor/execAsync.c + *-------------------------------------------------------------------- + */ +#ifndef EXECASYNC_H +#define EXECASYNC_H + +#include "nodes/execnodes.h" +#include "storage/latch.h" + +extern void ExecAsyncSetState(PlanState *pstate, AsyncState status); +extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node, + void *data, bool reinit); +extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, + long timeout); +#endif /* EXECASYNC_H */ diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index a7ea3c7d10..8e9d87669f 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -63,6 +63,7 @@ #define EXEC_FLAG_WITH_OIDS 0x0020 /* force OIDs in returned tuples */ #define EXEC_FLAG_WITHOUT_OIDS 0x0040 /* force no OIDs in returned tuples */ #define EXEC_FLAG_WITH_NO_DATA 0x0080 /* rel scannability doesn't matter */ +#define EXEC_FLAG_ASYNC 0x0100 /* request async execution */ /* Hook for plugins to get control in ExecutorStart() */ diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h index ccb66be733..67abf8e52e 100644 --- a/src/include/executor/nodeForeignscan.h +++ b/src/include/executor/nodeForeignscan.h @@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node, extern void ExecForeignScanInitializeWorker(ForeignScanState *node, ParallelWorkerContext *pwcxt); extern void ExecShutdownForeignScan(ForeignScanState *node); +extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); #endif /* NODEFOREIGNSCAN_H */ diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h index c14eb546c6..c00e9621fb 100644 --- a/src/include/foreign/fdwapi.h +++ b/src/include/foreign/fdwapi.h @@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root, typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root, List *fdw_private, RelOptInfo *child_rel); +typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path); +typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, + bool reinit); /* * FdwRoutine is the struct returned by a foreign-data wrapper's handler @@ -189,6 +194,7 @@ typedef struct FdwRoutine GetForeignPlan_function GetForeignPlan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; + IterateForeignScan_function IterateForeignScanAsync; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; @@ -241,6 +247,11 @@ typedef struct FdwRoutine InitializeDSMForeignScan_function InitializeDSMForeignScan; ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan; InitializeWorkerForeignScan_function InitializeWorkerForeignScan; + + /* Support functions for asynchronous execution */ + IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable; + ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait; + ShutdownForeignScan_function ShutdownForeignScan; /* Support functions for path reparameterization. */ diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index b6f1a9e6e5..41f0927934 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -94,6 +94,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b); extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper); extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b); extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b); +extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper); extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b); /* support for iterating through the integer elements of a set: */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index da7f52cab0..56bfe3f442 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate); * abstract superclass for all PlanState-type nodes. * ---------------- */ +typedef enum AsyncState +{ + AS_AVAILABLE, + AS_WAITING +} AsyncState; + typedef struct PlanState { NodeTag type; @@ -953,6 +959,9 @@ typedef struct PlanState * descriptor, without encoding knowledge about all executor nodes. */ TupleDesc scandesc; + + AsyncState asyncstate; + int32 padding; /* to keep alignment of derived types */ } PlanState; /* ---------------- @@ -1087,14 +1096,20 @@ struct AppendState PlanState ps; /* its first field is NodeTag */ PlanState **appendplans; /* array of PlanStates for my inputs */ int as_nplans; - int as_whichplan; + int as_whichsyncplan; /* which sync plan is being executed */ int as_first_partial_plan; /* Index of 'appendplans' containing * the first partial plan */ + int as_nasyncplans; /* # of async-capable children */ ParallelAppendState *as_pstate; /* parallel coordination info */ Size pstate_len; /* size of parallel coordination info */ struct PartitionPruneState *as_prune_state; Bitmapset *as_valid_subplans; bool (*choose_next_subplan) (AppendState *); + bool as_syncdone; /* all synchronous plans done? */ + Bitmapset *as_needrequest; /* async plans needing a new request */ + Bitmapset *as_pending_async; /* pending async plans */ + TupleTableSlot **as_asyncresult; /* unreturned results of async plans */ + int as_nasyncresult; /* # of valid entries in as_asyncresult */ }; /* ---------------- @@ -1643,6 +1658,7 @@ typedef struct ForeignScanState Size pscan_len; /* size of parallel coordination information */ /* use struct pointer to avoid including fdwapi.h here */ struct FdwRoutine *fdwroutine; + bool fs_async; void *fdw_state; /* foreign-data wrapper can keep state here */ } ForeignScanState; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index f2dda82e66..8a64c037c9 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -139,6 +139,11 @@ typedef struct Plan bool parallel_aware; /* engage parallel-aware logic? */ bool parallel_safe; /* OK to use as part of parallel plan? */ + /* + * information needed for asynchronous execution + */ + bool async_capable; /* engage asyncronous execution logic? */ + /* * Common structural data for all Plan types. */ @@ -262,6 +267,8 @@ typedef struct Append * Mapping details for run-time subplan pruning, one per partitioned_rels */ List *part_prune_infos; + int nasyncplans; /* # of async plans, always at start of list */ + int referent; /* index of inheritance tree referent */ } Append; /* ---------------- diff --git a/src/include/pgstat.h b/src/include/pgstat.h index be2f59239b..6f4583b46c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -832,7 +832,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_ASYNC_WAIT } WaitEventIPC; /* ---------- -- 2.16.3 From 72120b5c2b0775d33186dec7d4fc206e63094c20 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 19 Oct 2017 17:24:07 +0900 Subject: [PATCH 3/3] async postgres_fdw --- contrib/postgres_fdw/connection.c | 26 + contrib/postgres_fdw/expected/postgres_fdw.out | 198 ++++---- contrib/postgres_fdw/postgres_fdw.c | 633 ++++++++++++++++++++++--- contrib/postgres_fdw/postgres_fdw.h | 2 + contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +- 5 files changed, 708 insertions(+), 171 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index fe4893a8e0..da7c826e4f 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -58,6 +58,7 @@ typedef struct ConnCacheEntry bool invalidated; /* true if reconnect is pending */ uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ + void *storage; /* connection specific storage */ } ConnCacheEntry; /* @@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt) elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)", entry->conn, server->servername, user->umid, user->userid); + entry->storage = NULL; } /* @@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt) return entry->conn; } +/* + * Rerturns the connection specific storage for this user. Allocate with + * initsize if not exists. + */ +void * +GetConnectionSpecificStorage(UserMapping *user, size_t initsize) +{ + bool found; + ConnCacheEntry *entry; + ConnCacheKey key; + + key = user->umid; + entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found); + Assert(found); + + if (entry->storage == NULL) + { + entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize); + memset(entry->storage, 0, initsize); + } + + return entry->storage; +} + /* * Connect to remote server using specified server and user mapping properties. */ diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index bb6b1a8fdf..cddc207c04 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -6793,7 +6793,7 @@ INSERT INTO a(aa) VALUES('aaaaa'); INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+------- a | aaa @@ -6821,7 +6821,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6849,7 +6849,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | aaa @@ -6877,7 +6877,7 @@ SELECT tableoid::regclass, * FROM ONLY a; (3 rows) UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; tableoid | aa ----------+-------- a | newtoo @@ -6947,35 +6947,41 @@ insert into bar2 values(3,33,33); insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for update; + QUERY PLAN +----------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(29 rows) -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; f1 | f2 ----+---- 1 | 11 @@ -6985,35 +6991,41 @@ select * from bar where f1 in (select f1 from foo) for update; (4 rows) explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; - QUERY PLAN ----------------------------------------------------------------------------------------------- +select * from bar where f1 in (select f1 from foo) order by 1 for share; + QUERY PLAN +---------------------------------------------------------------------------------------------------------------- LockRows Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - -> Hash Join + -> Merge Join Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true - Hash Cond: (bar.f1 = foo.f1) - -> Append - -> Seq Scan on public.bar + Merge Cond: (bar.f1 = foo.f1) + -> Merge Append + Sort Key: bar.f1 + -> Sort Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid + Sort Key: bar.f1 + -> Seq Scan on public.bar + Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid - Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE - -> Hash + Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE + -> Sort Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Sort Key: foo.f1 -> HashAggregate Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(23 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(29 rows) -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; f1 | f2 ----+---- 1 | 11 @@ -7043,11 +7055,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid Inner Unique: true @@ -7061,12 +7074,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 1 + -> Async Foreign Scan on public.foo2 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -(39 rows) + -> Seq Scan on public.foo + Output: foo.ctid, foo.*, foo.tableoid, foo.f1 +(41 rows) update bar set f2 = f2 + 100 where f1 in (select f1 from foo); select tableoid::regclass, * from bar order by 1,2; @@ -7096,16 +7110,17 @@ where bar.f1 = ss.f1; Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1)) Hash Cond: (foo.f1 = bar.f1) -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 2 + -> Async Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) - -> Foreign Scan on public.foo2 foo2_1 + -> Async Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) -> Hash Output: bar.f1, bar.f2, bar.ctid -> Seq Scan on public.bar @@ -7123,17 +7138,18 @@ where bar.f1 = ss.f1; Output: (ROW(foo.f1)), foo.f1 Sort Key: foo.f1 -> Append - -> Seq Scan on public.foo - Output: ROW(foo.f1), foo.f1 - -> Foreign Scan on public.foo2 + Async subplans: 2 + -> Async Foreign Scan on public.foo2 Output: ROW(foo2.f1), foo2.f1 Remote SQL: SELECT f1 FROM public.loct1 - -> Seq Scan on public.foo foo_1 - Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) - -> Foreign Scan on public.foo2 foo2_1 + -> Async Foreign Scan on public.foo2 foo2_1 Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3) Remote SQL: SELECT f1 FROM public.loct1 -(45 rows) + -> Seq Scan on public.foo + Output: ROW(foo.f1), foo.f1 + -> Seq Scan on public.foo foo_1 + Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3) +(47 rows) update bar set f2 = f2 + 100 from @@ -7283,27 +7299,33 @@ delete from foo where f1 < 5 returning *; (5 rows) explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; - QUERY PLAN ------------------------------------------------------------------------------- - Update on public.bar - Output: bar.f1, bar.f2 - Update on public.bar - Foreign Update on public.bar2 - -> Seq Scan on public.bar - Output: bar.f1, (bar.f2 + 100), bar.ctid - -> Foreign Update on public.bar2 - Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 -(8 rows) +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Output: u.f1, u.f2 + Sort Key: u.f1 + CTE u + -> Update on public.bar + Output: bar.f1, bar.f2 + Update on public.bar + Foreign Update on public.bar2 + -> Seq Scan on public.bar + Output: bar.f1, (bar.f2 + 100), bar.ctid + -> Foreign Update on public.bar2 + Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2 + -> CTE Scan on u + Output: u.f1, u.f2 +(14 rows) -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; f1 | f2 ----+----- 1 | 311 2 | 322 - 6 | 266 3 | 333 4 | 344 + 6 | 266 7 | 277 (6 rows) @@ -8155,11 +8177,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J Sort Sort Key: t1.a, t3.c -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3) - -> Foreign Scan + -> Async Foreign Scan Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3) -(7 rows) +(8 rows) SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a% 25 =0 ORDER BY 1,2,3; a | b | c @@ -8178,9 +8201,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) Sort Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c -> Append - -> Foreign Scan + Async subplans: 1 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) -(5 rows) +(6 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; a | b | c @@ -8200,11 +8224,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE Sort Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2) -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Foreign Scan + -> Async Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(7 rows) +(8 rows) SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2; t1 | t2 @@ -8223,11 +8248,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t Sort Sort Key: t1.a, t1.b -> Append - -> Foreign Scan + Async subplans: 2 + -> Async Foreign Scan Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2) - -> Foreign Scan + -> Async Foreign Scan Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2) -(7 rows) +(8 rows) SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25= 0 ORDER BY 1,2; a | b @@ -8309,10 +8335,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Group Key: fpagg_tab_p1.a Filter: (avg(fpagg_tab_p1.b) < '22'::numeric) -> Append - -> Foreign Scan on fpagg_tab_p1 - -> Foreign Scan on fpagg_tab_p2 - -> Foreign Scan on fpagg_tab_p3 -(9 rows) + Async subplans: 3 + -> Async Foreign Scan on fpagg_tab_p1 + -> Async Foreign Scan on fpagg_tab_p2 + -> Async Foreign Scan on fpagg_tab_p3 +(10 rows) -- Plan with partitionwise aggregates is enabled SET enable_partitionwise_aggregate TO true; @@ -8323,13 +8350,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O Sort Sort Key: fpagg_tab_p1.a -> Append - -> Foreign Scan + Async subplans: 3 + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab) - -> Foreign Scan + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab) - -> Foreign Scan + -> Async Foreign Scan Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab) -(9 rows) +(10 rows) SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1; a | sum | min | count diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 78b0f43ca8..51d19cc421 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -20,6 +20,8 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" +#include "executor/nodeForeignscan.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -34,6 +36,7 @@ #include "optimizer/var.h" #include "optimizer/tlist.h" #include "parser/parsetree.h" +#include "pgstat.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/lsyscache.h" @@ -53,6 +56,9 @@ PG_MODULE_MAGIC; /* If no remote estimates, assume a sort costs 20% extra */ #define DEFAULT_FDW_SORT_MULTIPLIER 1.2 +/* Retrive PgFdwScanState struct from ForeginScanState */ +#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state) + /* * Indexes of FDW-private information stored in fdw_private lists. * @@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex FdwDirectModifyPrivateSetProcessed }; +/* + * Connection private area structure. + */ +typedef struct PgFdwConnpriv +{ + ForeignScanState *leader; /* leader node of this connection */ + bool busy; /* true if this connection is busy */ +} PgFdwConnpriv; + +/* Execution state base type */ +typedef struct PgFdwState +{ + PGconn *conn; /* connection for the scan */ + PgFdwConnpriv *connpriv; /* connection private memory */ +} PgFdwState; + /* * Execution state of a foreign scan using postgres_fdw. */ typedef struct PgFdwScanState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table. NULL * for a foreign join scan. */ TupleDesc tupdesc; /* tuple descriptor of scan */ @@ -134,7 +157,7 @@ typedef struct PgFdwScanState List *retrieved_attrs; /* list of retrieved attribute numbers */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ + bool result_ready; unsigned int cursor_number; /* quasi-unique ID for my cursor */ bool cursor_exists; /* have we created the cursor? */ int numParams; /* number of parameters passed to query */ @@ -150,6 +173,12 @@ typedef struct PgFdwScanState /* batch-level state, for optimizing rewinds and avoiding useless fetch */ int fetch_ct_2; /* Min(# of fetches done, 2) */ bool eof_reached; /* true if last fetch reached EOF */ + bool run_async; /* true if run asynchronously */ + bool inqueue; /* true if this node is in waiter queue */ + ForeignScanState *waiter; /* Next node to run a query among nodes + * sharing the same connection */ + ForeignScanState *last_waiter; /* last waiting node in waiting queue. + * valid only on the leader node */ /* working memory contexts */ MemoryContext batch_cxt; /* context holding current batch of tuples */ @@ -163,11 +192,11 @@ typedef struct PgFdwScanState */ typedef struct PgFdwModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ /* for remote query execution */ - PGconn *conn; /* connection for the scan */ char *p_name; /* name of prepared statement, if created */ /* extracted fdw_private data */ @@ -190,6 +219,7 @@ typedef struct PgFdwModifyState */ typedef struct PgFdwDirectModifyState { + PgFdwState s; /* common structure */ Relation rel; /* relcache entry for the foreign table */ AttInMetadata *attinmeta; /* attribute datatype conversion metadata */ @@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags); static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node); static void postgresReScanForeignScan(ForeignScanState *node); static void postgresEndForeignScan(ForeignScanState *node); +static void postgresShutdownForeignScan(ForeignScanState *node); static void postgresAddForeignUpdateTargets(Query *parsetree, RangeTblEntry *target_rte, Relation target_relation); @@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, RelOptInfo *input_rel, RelOptInfo *output_rel, void *extra); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static bool postgresForeignAsyncConfigureWait(ForeignScanState *node, + WaitEventSet *wes, + void *caller_data, bool reinit); /* * Helper functions @@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, EquivalenceClass *ec, EquivalenceMember *em, void *arg); static void create_cursor(ForeignScanState *node); -static void fetch_more_data(ForeignScanState *node); +static void request_more_data(ForeignScanState *node); +static void fetch_received_data(ForeignScanState *node); +static void vacate_connection(PgFdwState *fdwconn, bool clear_queue); static void close_cursor(PGconn *conn, unsigned int cursor_number); static PgFdwModifyState *create_foreign_modify(EState *estate, RangeTblEntry *rte, @@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) routine->IterateForeignScan = postgresIterateForeignScan; routine->ReScanForeignScan = postgresReScanForeignScan; routine->EndForeignScan = postgresEndForeignScan; + routine->ShutdownForeignScan = postgresShutdownForeignScan; /* Functions for updating foreign tables */ routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets; @@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait; + PG_RETURN_POINTER(routine); } @@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - fsstate->conn = GetConnection(user, false); + fsstate->s.conn = GetConnection(user, false); + fsstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); + fsstate->s.connpriv->leader = NULL; + fsstate->s.connpriv->busy = false; + fsstate->waiter = NULL; + fsstate->last_waiter = node; /* Assign a unique ID for my cursor */ - fsstate->cursor_number = GetCursorNumber(fsstate->conn); + fsstate->cursor_number = GetCursorNumber(fsstate->s.conn); fsstate->cursor_exists = false; + /* Initialize async execution status */ + fsstate->run_async = false; + fsstate->inqueue = false; + /* Get private info created by planner functions. */ fsstate->query = strVal(list_nth(fsplan->fdw_private, FdwScanPrivateSelectSql)); @@ -1408,40 +1460,258 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags) &fsstate->param_values); } +/* + * Async queue manipuration functions + */ + +/* + * add_async_waiter: + * + * adds the node to the end of waiter queue. Immediately starts the node if no + * node is running + */ +static inline void +add_async_waiter(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + + /* do nothing if the node is already in the queue or already eof'ed */ + if (leader == node || fsstate->inqueue || fsstate->eof_reached) + return; + + if (leader == NULL) + { + /* immediately send request if not busy */ + request_more_data(node); + } + else + { + PgFdwScanState *leader_state = GetPgFdwScanState(leader); + PgFdwScanState *last_waiter_state + = GetPgFdwScanState(leader_state->last_waiter); + + last_waiter_state->waiter = node; + leader_state->last_waiter = node; + fsstate->inqueue = true; + } +} + +/* + * move_to_next_waiter: + * + * Makes the first waiter be next leader + * Returns the new leader or NULL if there's no waiter. + */ +static inline ForeignScanState * +move_to_next_waiter(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *ret = fsstate->waiter; + + Assert(fsstate->s.connpriv->leader = node); + + if (ret) + { + PgFdwScanState *retstate = GetPgFdwScanState(ret); + fsstate->waiter = NULL; + retstate->last_waiter = fsstate->last_waiter; + retstate->inqueue = false; + } + + fsstate->s.connpriv->leader = ret; + + return ret; +} + +/* + * remove the node from waiter queue + * + * This is a bit different from the two above in the sense that this can + * operate on connection leader. The result is absorbed when this is called on + * active leader. + * + * Returns true if the node was found. + */ +static inline bool +remove_async_node(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + PgFdwScanState *leader_state; + ForeignScanState *prev; + PgFdwScanState *prev_state; + ForeignScanState *cur; + + /* no need to remove me */ + if (!leader || !fsstate->inqueue) + return false; + + leader_state = GetPgFdwScanState(leader); + + /* Remove the leader node */ + if (leader == node) + { + ForeignScanState *next_leader; + + if (leader_state->s.connpriv->busy) + { + /* + * this node is waiting for result, absorb the result first so + * that the following commands can be sent on the connection. + */ + PgFdwScanState *leader_state = GetPgFdwScanState(leader); + PGconn *conn = leader_state->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + + leader_state->s.connpriv->busy = false; + } + + /* Make the first waiter the leader */ + if (leader_state->waiter) + { + PgFdwScanState *next_leader_state; + + next_leader = leader_state->waiter; + next_leader_state = GetPgFdwScanState(next_leader); + + leader_state->s.connpriv->leader = next_leader; + next_leader_state->last_waiter = leader_state->last_waiter; + } + leader_state->waiter = NULL; + + return true; + } + + /* + * Just remove the node in queue + * + * This function is called on the shutdown path. We don't bother + * considering faster way to do this. + */ + prev = leader; + prev_state = leader_state; + cur = GetPgFdwScanState(prev)->waiter; + while (cur) + { + PgFdwScanState *curstate = GetPgFdwScanState(cur); + + if (cur == node) + { + prev_state->waiter = curstate->waiter; + if (leader_state->last_waiter == cur) + leader_state->last_waiter = prev; + else + leader_state->last_waiter = cur; + + fsstate->inqueue = false; + + return true; + } + prev = cur; + prev_state = curstate; + cur = curstate->waiter; + } + + return false; +} + /* * postgresIterateForeignScan - * Retrieve next row from the result set, or clear tuple slot to indicate - * EOF. + * Retrieve next row from the result set. + * + * For synchronous nodes, returns clear tuples slot to indicte EOF. + * + * If the node is asynchronous one, clear tuple slot has two meanings. + * If the caller receives clear tuple slot, asyncstate indicates wheter + * the node is EOF (AS_AVAILABLE) or waiting for data to + * come(AS_WAITING). */ static TupleTableSlot * postgresIterateForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; - /* - * If this is the first call after Begin or ReScan, we need to create the - * cursor on the remote side. - */ - if (!fsstate->cursor_exists) - create_cursor(node); + if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached) + { + /* we've run out, get some more tuples */ + if (!node->fs_async) + { + /* finish running query to send my command */ + if (!fsstate->s.connpriv->busy) + vacate_connection((PgFdwState *)fsstate, false); + + request_more_data(node); + + /* + * Fetch the result immediately. This executes the next waiter if + * any. + */ + fetch_received_data(node); + } + else if (!fsstate->s.connpriv->busy) + { + /* If the connection is not busy, just send the request. */ + request_more_data(node); + } + else + { + /* This connection is busy */ + bool available = true; + ForeignScanState *leader = fsstate->s.connpriv->leader; + PgFdwScanState *leader_state = GetPgFdwScanState(leader); + + /* Check if the result is immediately available */ + if (PQisBusy(leader_state->s.conn)) + { + int rc = WaitLatchOrSocket(NULL, + WL_SOCKET_READABLE | WL_TIMEOUT, + PQsocket(leader_state->s.conn), 0, + WAIT_EVENT_ASYNC_WAIT); + if (!(rc & WL_SOCKET_READABLE)) + available = false; + } + + /* The next waiter is executed automatcically */ + if (available) + fetch_received_data(leader); + + /* add the requested node */ + add_async_waiter(node); + + /* add the previous leader */ + add_async_waiter(leader); + } + } /* - * Get some more tuples, if we've run out. + * If we haven't received a result for the given node this time, + * return with no tuple to give way to another node. */ if (fsstate->next_tuple >= fsstate->num_tuples) { - /* No point in another fetch if we already detected EOF, though. */ - if (!fsstate->eof_reached) - fetch_more_data(node); - /* If we didn't get any tuples, must be end of data. */ - if (fsstate->next_tuple >= fsstate->num_tuples) - return ExecClearTuple(slot); + if (fsstate->eof_reached) + { + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; + } + else + { + fsstate->result_ready = false; + node->ss.ps.asyncstate = AS_WAITING; + } + + return ExecClearTuple(slot); } /* * Return the next tuple. */ + fsstate->result_ready = true; + node->ss.ps.asyncstate = AS_AVAILABLE; ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++], slot, InvalidBuffer, @@ -1457,7 +1727,7 @@ postgresIterateForeignScan(ForeignScanState *node) static void postgresReScanForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); char sql[64]; PGresult *res; @@ -1465,6 +1735,8 @@ postgresReScanForeignScan(ForeignScanState *node) if (!fsstate->cursor_exists) return; + vacate_connection((PgFdwState *)fsstate, true); + /* * If any internal parameters affecting this node have changed, we'd * better destroy and recreate the cursor. Otherwise, rewinding it should @@ -1493,9 +1765,9 @@ postgresReScanForeignScan(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fsstate->conn, sql); + res = pgfdw_exec_query(fsstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fsstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql); PQclear(res); /* Now force a fresh FETCH. */ @@ -1513,7 +1785,7 @@ postgresReScanForeignScan(ForeignScanState *node) static void postgresEndForeignScan(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); /* if fsstate is NULL, we are in EXPLAIN; nothing to do */ if (fsstate == NULL) @@ -1521,15 +1793,31 @@ postgresEndForeignScan(ForeignScanState *node) /* Close the cursor if open, to prevent accumulation of cursors */ if (fsstate->cursor_exists) - close_cursor(fsstate->conn, fsstate->cursor_number); + close_cursor(fsstate->s.conn, fsstate->cursor_number); /* Release remote connection */ - ReleaseConnection(fsstate->conn); - fsstate->conn = NULL; + ReleaseConnection(fsstate->s.conn); + fsstate->s.conn = NULL; /* MemoryContexts will be deleted automatically. */ } +/* + * postgresShutdownForeignScan + * Remove asynchrony stuff and cleanup garbage on the connection. + */ +static void +postgresShutdownForeignScan(ForeignScanState *node) +{ + ForeignScan *plan = (ForeignScan *) node->ss.ps.plan; + + if (plan->operation != CMD_SELECT) + return; + + /* remove the node from waiting queue */ + remove_async_node(node); +} + /* * postgresAddForeignUpdateTargets * Add resjunk column(s) needed for update/delete on a foreign table @@ -1753,6 +2041,9 @@ postgresExecForeignInsert(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1763,14 +2054,14 @@ postgresExecForeignInsert(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1778,10 +2069,10 @@ postgresExecForeignInsert(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1819,6 +2110,9 @@ postgresExecForeignUpdate(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1839,14 +2133,14 @@ postgresExecForeignUpdate(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1854,10 +2148,10 @@ postgresExecForeignUpdate(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -1895,6 +2189,9 @@ postgresExecForeignDelete(EState *estate, PGresult *res; int n_rows; + /* finish running query to send my command */ + vacate_connection((PgFdwState *)fmstate, true); + /* Set up the prepared statement on the remote server, if we didn't yet */ if (!fmstate->p_name) prepare_foreign_modify(fmstate); @@ -1915,14 +2212,14 @@ postgresExecForeignDelete(EState *estate, /* * Execute the prepared statement. */ - if (!PQsendQueryPrepared(fmstate->conn, + if (!PQsendQueryPrepared(fmstate->s.conn, fmstate->p_name, fmstate->p_nums, p_values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -1930,10 +2227,10 @@ postgresExecForeignDelete(EState *estate, * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != (fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) @@ -2400,7 +2697,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags) * Get connection to the foreign server. Connection manager will * establish new connection if necessary. */ - dmstate->conn = GetConnection(user, false); + dmstate->s.conn = GetConnection(user, false); + dmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); /* Update the foreign-join-related fields. */ if (fsplan->scan.scanrelid == 0) @@ -2485,7 +2784,11 @@ postgresIterateDirectModify(ForeignScanState *node) * If this is the first call after Begin, execute the statement. */ if (dmstate->num_tuples == -1) + { + /* finish running query to send my command */ + vacate_connection((PgFdwState *)dmstate, true); execute_dml_stmt(node); + } /* * If the local query doesn't specify RETURNING, just clear tuple slot. @@ -2532,8 +2835,8 @@ postgresEndDirectModify(ForeignScanState *node) PQclear(dmstate->result); /* Release remote connection */ - ReleaseConnection(dmstate->conn); - dmstate->conn = NULL; + ReleaseConnection(dmstate->s.conn); + dmstate->s.conn = NULL; /* close the target relation. */ if (dmstate->resultRel) @@ -2656,6 +2959,7 @@ estimate_path_cost_size(PlannerInfo *root, List *local_param_join_conds; StringInfoData sql; PGconn *conn; + PgFdwConnpriv *connpriv; Selectivity local_sel; QualCost local_cost; List *fdw_scan_tlist = NIL; @@ -2698,6 +3002,18 @@ estimate_path_cost_size(PlannerInfo *root, /* Get the remote estimate */ conn = GetConnection(fpinfo->user, false); + connpriv = GetConnectionSpecificStorage(fpinfo->user, + sizeof(PgFdwConnpriv)); + if (connpriv) + { + PgFdwState tmpstate; + tmpstate.conn = conn; + tmpstate.connpriv = connpriv; + + /* finish running query to send my command */ + vacate_connection(&tmpstate, true); + } + get_remote_estimate(sql.data, conn, &rows, &width, &startup_cost, &total_cost); ReleaseConnection(conn); @@ -3061,11 +3377,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel, static void create_cursor(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); ExprContext *econtext = node->ss.ps.ps_ExprContext; int numParams = fsstate->numParams; const char **values = fsstate->param_values; - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; StringInfoData buf; PGresult *res; @@ -3128,50 +3444,127 @@ create_cursor(ForeignScanState *node) } /* - * Fetch some more rows from the node's cursor. + * Sends the next request of the node. If the given node is different from the + * current connection leader, pushes it back to waiter queue and let the given + * node be the leader. */ static void -fetch_more_data(ForeignScanState *node) +request_more_data(ForeignScanState *node) { - PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state; + PgFdwScanState *fsstate = GetPgFdwScanState(node); + ForeignScanState *leader = fsstate->s.connpriv->leader; + PGconn *conn = fsstate->s.conn; + char sql[64]; + + /* must be non-busy */ + Assert(!fsstate->s.connpriv->busy); + /* must be not-eof */ + Assert(!fsstate->eof_reached); + + /* + * If this is the first call after Begin or ReScan, we need to create the + * cursor on the remote side. + */ + if (!fsstate->cursor_exists) + create_cursor(node); + + snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", + fsstate->fetch_size, fsstate->cursor_number); + + if (!PQsendQuery(conn, sql)) + pgfdw_report_error(ERROR, NULL, conn, false, sql); + + fsstate->s.connpriv->busy = true; + + /* Let the node be the leader if it is different from current one */ + if (leader != node) + { + /* + * If the connection leader exists, insert the node as the connection + * leader making the current leader be the first waiter. + */ + if (leader != NULL) + { + remove_async_node(node); + fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter; + fsstate->waiter = leader; + } + else + { + fsstate->last_waiter = node; + fsstate->waiter = NULL; + } + + fsstate->s.connpriv->leader = node; + } +} + +/* + * Fetches received data and automatically send requests of the next waiter. + */ +static void +fetch_received_data(ForeignScanState *node) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); PGresult *volatile res = NULL; MemoryContext oldcontext; + ForeignScanState *waiter; + + /* I should be the current connection leader */ + Assert(fsstate->s.connpriv->leader == node); /* * We'll store the tuples in the batch_cxt. First, flush the previous - * batch. + * batch if no tuple is remaining */ - fsstate->tuples = NULL; - MemoryContextReset(fsstate->batch_cxt); + if (fsstate->next_tuple >= fsstate->num_tuples) + { + fsstate->tuples = NULL; + fsstate->num_tuples = 0; + MemoryContextReset(fsstate->batch_cxt); + } + else if (fsstate->next_tuple > 0) + { + /* move the remaining tuples to the beginning of the store */ + int n = 0; + + while(fsstate->next_tuple < fsstate->num_tuples) + fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++]; + fsstate->num_tuples = n; + } + oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt); /* PGresult must be released before leaving this function. */ PG_TRY(); { - PGconn *conn = fsstate->conn; + PGconn *conn = fsstate->s.conn; char sql[64]; - int numrows; + int addrows; + size_t newsize; int i; snprintf(sql, sizeof(sql), "FETCH %d FROM c%u", fsstate->fetch_size, fsstate->cursor_number); - res = pgfdw_exec_query(conn, sql); + res = pgfdw_get_result(conn, sql); /* On error, report the original query, not the FETCH. */ if (PQresultStatus(res) != PGRES_TUPLES_OK) pgfdw_report_error(ERROR, res, conn, false, fsstate->query); /* Convert the data into HeapTuples */ - numrows = PQntuples(res); - fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple)); - fsstate->num_tuples = numrows; - fsstate->next_tuple = 0; + addrows = PQntuples(res); + newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple); + if (fsstate->tuples) + fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize); + else + fsstate->tuples = (HeapTuple *) palloc(newsize); - for (i = 0; i < numrows; i++) + for (i = 0; i < addrows; i++) { Assert(IsA(node->ss.ps.plan, ForeignScan)); - fsstate->tuples[i] = + fsstate->tuples[fsstate->num_tuples + i] = make_tuple_from_result_row(res, i, fsstate->rel, fsstate->attinmeta, @@ -3181,26 +3574,76 @@ fetch_more_data(ForeignScanState *node) } /* Update fetch_ct_2 */ - if (fsstate->fetch_ct_2 < 2) + if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0) fsstate->fetch_ct_2++; + fsstate->next_tuple = 0; + fsstate->num_tuples += addrows; + /* Must be EOF if we didn't get as many tuples as we asked for. */ - fsstate->eof_reached = (numrows < fsstate->fetch_size); + fsstate->eof_reached = (addrows < fsstate->fetch_size); PQclear(res); res = NULL; } PG_CATCH(); { + fsstate->s.connpriv->busy = false; + if (res) PQclear(res); PG_RE_THROW(); } PG_END_TRY(); + fsstate->s.connpriv->busy = false; + + /* let the first waiter be the next leader of this connection */ + waiter = move_to_next_waiter(node); + + /* send the next request if any */ + if (waiter) + request_more_data(waiter); + MemoryContextSwitchTo(oldcontext); } +/* + * Vacate a connection so that this node can send the next query + */ +static void +vacate_connection(PgFdwState *fdwstate, bool clear_queue) +{ + PgFdwConnpriv *connpriv = fdwstate->connpriv; + ForeignScanState *leader; + + /* the connection is alrady available */ + if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy) + return; + + /* + * let the current connection leader read the result for the running query + */ + leader = connpriv->leader; + fetch_received_data(leader); + + /* let the first waiter be the next leader of this connection */ + move_to_next_waiter(leader); + + if (!clear_queue) + return; + + /* Clear the waiting list */ + while (leader) + { + PgFdwScanState *fsstate = GetPgFdwScanState(leader); + + fsstate->last_waiter = NULL; + leader = fsstate->waiter; + fsstate->waiter = NULL; + } +} + /* * Force assorted GUC parameters to settings that ensure that we'll output * data values in a form that is unambiguous to the remote server. @@ -3314,7 +3757,9 @@ create_foreign_modify(EState *estate, user = GetUserMapping(userid, table->serverid); /* Open connection; report that we'll create a prepared statement. */ - fmstate->conn = GetConnection(user, true); + fmstate->s.conn = GetConnection(user, true); + fmstate->s.connpriv = (PgFdwConnpriv *) + GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv)); fmstate->p_name = NULL; /* prepared statement not made yet */ /* Set up remote query information. */ @@ -3387,7 +3832,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) /* Construct name we'll use for the prepared statement. */ snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u", - GetPrepStmtNumber(fmstate->conn)); + GetPrepStmtNumber(fmstate->s.conn)); p_name = pstrdup(prep_name); /* @@ -3397,12 +3842,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * the prepared statements we use in this module are simple enough that * the remote server will make the right choices. */ - if (!PQsendPrepare(fmstate->conn, + if (!PQsendPrepare(fmstate->s.conn, p_name, fmstate->query, 0, NULL)) - pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query); + pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query); /* * Get the result, and check for success. @@ -3410,9 +3855,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_get_result(fmstate->conn, fmstate->query); + res = pgfdw_get_result(fmstate->s.conn, fmstate->query); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query); PQclear(res); /* This action shows that the prepare has been done. */ @@ -3537,16 +3982,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - res = pgfdw_exec_query(fmstate->conn, sql); + res = pgfdw_exec_query(fmstate->s.conn, sql); if (PQresultStatus(res) != PGRES_COMMAND_OK) - pgfdw_report_error(ERROR, res, fmstate->conn, true, sql); + pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql); PQclear(res); fmstate->p_name = NULL; } /* Release remote connection */ - ReleaseConnection(fmstate->conn); - fmstate->conn = NULL; + ReleaseConnection(fmstate->s.conn); + fmstate->s.conn = NULL; } /* @@ -3706,9 +4151,9 @@ execute_dml_stmt(ForeignScanState *node) * the desired result. This allows us to avoid assuming that the remote * server has the same OIDs we do for the parameters' types. */ - if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams, + if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams, NULL, values, NULL, NULL, 0)) - pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query); + pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query); /* * Get the result, and check for success. @@ -3716,10 +4161,10 @@ execute_dml_stmt(ForeignScanState *node) * We don't use a PG_TRY block here, so be careful not to throw error * without releasing the PGresult. */ - dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query); + dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query); if (PQresultStatus(dmstate->result) != (dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK)) - pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true, + pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true, dmstate->query); /* Get the number of rows affected. */ @@ -5203,6 +5648,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root, /* XXX Consider parameterized paths for the join relation */ } +static bool +postgresIsForeignPathAsyncCapable(ForeignPath *path) +{ + return true; +} + + +/* + * Configure waiting event. + * + * Add an wait event only when the node is the connection leader. Elsewise + * another node on this connection is the leader. + */ +static bool +postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes, + void *caller_data, bool reinit) +{ + PgFdwScanState *fsstate = GetPgFdwScanState(node); + + + /* If the caller didn't reinit, this event is already in event set */ + if (!reinit) + return true; + + if (fsstate->s.connpriv->leader == node) + { + AddWaitEventToSet(wes, + WL_SOCKET_READABLE, PQsocket(fsstate->s.conn), + NULL, caller_data); + return true; + } + + return false; +} + + /* * Assess whether the aggregation, grouping and having operations can be pushed * down to the foreign server. As a side effect, save information we obtain in diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index a5d4011e8d..f344fb7f66 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo UserMapping *user; /* only set in use_remote_estimate mode */ int fetch_size; /* fetch size for this remote table */ + bool allow_prefetch; /* true to allow overlapped fetching */ /* * Name of the relation while EXPLAINing ForeignScan. It is used for join @@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt); +void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize); extern void ReleaseConnection(PGconn *conn); extern unsigned int GetCursorNumber(PGconn *conn); extern unsigned int GetPrepStmtNumber(PGconn *conn); diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index 231b1e01a5..8ecc903c20 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb'); INSERT INTO b(aa) VALUES('bbbb'); INSERT INTO b(aa) VALUES('bbbbb'); -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE b SET aa = 'new'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; UPDATE a SET aa = 'newtoo'; -SELECT tableoid::regclass, * FROM a; +SELECT tableoid::regclass, * FROM a ORDER BY 1, 2; SELECT tableoid::regclass, * FROM b; SELECT tableoid::regclass, * FROM ONLY a; @@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44); insert into bar2 values(7,77,77); explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for update; -select * from bar where f1 in (select f1 from foo) for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; +select * from bar where f1 in (select f1 from foo) order by 1 for update; explain (verbose, costs off) -select * from bar where f1 in (select f1 from foo) for share; -select * from bar where f1 in (select f1 from foo) for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; +select * from bar where f1 in (select f1 from foo) order by 1 for share; -- Check UPDATE with inherited target and an inherited source table explain (verbose, costs off) @@ -1741,8 +1741,8 @@ explain (verbose, costs off) delete from foo where f1 < 5 returning *; delete from foo where f1 < 5 returning *; explain (verbose, costs off) -update bar set f2 = f2 + 100 returning *; -update bar set f2 = f2 + 100 returning *; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; +with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1; -- Test that UPDATE/DELETE with inherited target works with row-level triggers CREATE TRIGGER trig_row_before -- 2.16.3