postgres_fdw & async queries - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | postgres_fdw & async queries |
Date | |
Msg-id | 20131104032604.GB2706@tamriel.snowman.net Whole thread Raw |
List | pgsql-hackers |
Greetings, I was poking around the Append node and considering my earlier suggestion to build an Async method of pulling data out ofnodes under Append which support that option. It looks workable, but I was also considering simply changing postgres_fdwto use async queries instead- which are started at ExecInit time instead. A few more specifics about what I've been thinking: #1 - Add an Async mechanism to the executor - ExecAsyncCapable - Check if the node supports async. - ExecAsyncStart - Begin an async request from the node - ExecAsyncGetFD- Return FD to be used with select() - ExecAsyncConsume - Consume data from FD - ExecAsyncBusy - Checkif getResult would block - ExecAsyncScan - Get next tuple - All of the above added to the FDW API also.. This is clearly modeled off of the libpq async system and can likely be simplified, but it feels like we'd want at leastthings like GetFD/Consume/Busy, to allow us to build a select() inside Append which would wait until data is availablesomewhere and would then work through all of the nodes (dealing with ones which are done and don't return anything)until it's able to return a row back up. As an aside- I've always wondered if we should have an explicit bulk mechanism instead of one-row-at-a-time and perhaps we build that into this API. I was also thinking we would havean identifier of some kind provided through the API which would indicate which I/O channels are shared and we'd thenset up Append with a two-level list of paths to walk, where it would issue one request against each distinct I/Ochannel and simply move on to the next entry for the same I/O channel when the prior one completes. #2 - Make postgres_fdw build/send an async query during ExecInitNode. Currently, we wait to set up the remote cursor and fetch records until we've actually been asked for a record- but thenwe go and try to get 100 of them. It would seem like we might be able to simply call create_cursor() at the bottomof postgresBeginForeignScan and follow that up with a PQsendQuery, postgresIterateForeignScan wouldn't really changeexcept for not being asked to also create the cursor. Sure, we'd end up blocking if there isn't data available for this node yet, but at least we'd get the query started during Init across all the remote servers, which would certainlybe a massive improvment and we wouldn't need to modify Append or the FDW API at all. Practically speaking,we'd parallelize the initial request of 100 tuples and then scan through the results in order, meaning we'donly pull from one machine at a time if the result set is larger than that initial 100 tuples per system. That said,if most of the work on the remote systems is getting the query started, it would still be a win. This goes against the current documention, which explicitly states that the actual scan should not be started untilIterateForeignScan, but it's not clear, to me at least, why that's strictly necessary. One other thought going in favor of #1 is that we could make other nodes, such as SeqScan, support the new API which wouldallow us to parallelize across, say, mutliple tablespaces (on the presumption that they are independent I/O systemsunderneath, which may or may not be true, of course; perhaps we could explicitly ask the user for the I/O channelrelationship during tablespace creation). With either suggestion, we would need to ensure that postgres_fdw worksthrough the entire Async query and stores the results before trying to do another Async query, if we're going to keepthe one-connection model, as we can't have two Async queries running at the same time. I don't see that as a terribleissue though- we're already fetching/cacheing up to 100 rows of data for the foreign table anyway, and we'd stillhave the actual cursor. If we get a request for a different cursor while we have an Async still open, we'd just finishout the current Async request of 100-or-fewer tuples, cache them, and then send a new request for the new relation. This is all speculation at this point as I've not gotten down into the details of trying to implement any of it, so pleasefeel free to point out any big holes. :) Thoughts? Thanks, Stephen
pgsql-hackers by date: