Thread: Parallel Sort
It would be great if one client session could take advantage of multiple CPU cores. EnterpriseDB wishes to start the trek into this problem space for 9.4 by implementing parallel internal (i.e. not spilling to disk) sort. This touches on a notable subset of the infrastructure components we'll need for parallel general query. My intent is to map out the key design topics, hear about critical topics I hadn't considered, and solicit feedback on the quality of the high-level plan. Full designs for key pieces will come later. * Worker Backends A multi-process model, rather than a multi-thread, is best for introducing parallelism to a PostgreSQL session. With threading, all in-memory state would become shared by default; with processes, in-memory data structures remain unshared until we explicitly select them for sharing. Some data structures will require new sharing, but I expect unshared ones to remain the typical case. I refer to these additional processes as worker backends. They will have much in common with ordinary client backends, including a database connection and a PGPROC. They will have no client socket; instead, they will take direction from the client-connected backend, termed the master, via shared memory. Each worker needs to make SnapshotNow visibility decisions coherent with the master. For sorting, this allows us to look up comparison functions, even when the current transaction created or modified those functions. This will also be an essential building block for any parallelism project that consults user tables. Implementing this means copying the subtransaction stack and the combocid hash to each worker. For the sake of completeness, we should also copy the global MVCC snapshot data (sorting probably won't care). It also means forbidding, while a parallel task is in flight, operations that affect the transaction state: - CommandCounterIncrement() - GetCurrentCommandId(true) - AssignTransactionId() - subtransaction-management functions I don't intend to commit us to a design fundamentally antagonistic to, for example, starting a subtransaction in a worker backend. That being said, I think we could achieve a lot in the parallel query space before we would deem it important to relax those restrictions. Workers will copy all GUCs from the master. bttextcmp() needs lc_collate, and we'll inevitably reach the point of needing broad GUC coverage as we make more things run in parallel. Further GUC changes will be forbidden while a parallel task is in flight. (More generally, anything that mutates backend-local state without reference to shared state will either need to be forbidden or converted to route through shared state.) The heavyweight locking mechanism will need to be aware of the association between the master and its workers. For sorting, workers will acquire heavyweight locks on system catalogs only. It is almost enough for the workers to behave as though they are independent of the master for locking purposes. But an undetected deadlock would arise if the master holds an AccessExclusiveLock on a system catalog a worker needs. We can allocate a small amount of permanent shared memory for coordination among a group of processes, but sorting will benefit from a region as large as maintenance_work_mem. Expect on-demand memory sharing. * Identifying Parallel-Compatible Functions Not all functions can reasonably run on a worker backend. We should not presume that a VOLATILE function can tolerate the unstable execution order imposed by parallelism, though a function like clock_timestamp() is perfectly reasonable to run that way. STABLE does not have that problem, but neither does it constitute a promise that the function implementation is compatible with parallel execution. Consider xid_age(), which would need code changes to operate correctly in parallel. IMMUTABLE almost guarantees enough; there may come a day when all IMMUTABLE functions can be presumed parallel-safe. For now, an IMMUTABLE function could cause trouble by starting a (read-only) subtransaction. The bottom line is that parallel-compatibility needs to be separate from volatility classes for the time being. I'm not sure what the specific answer here should look like. Simply having a CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because the rules are liable to loosen over time. * Planner & Similar Issues We should decide whether to actually sort in parallel based on the comparator cost and the data size. The system currently has no information on comparator cost: bt*cmp (and indeed almost all built-in functions) all have procost=1, but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the procost estimates of all core B-tree and hash operators. This will have other benefits, but we will need to be cognizant of the risk of upsetting setups that have tuned cpu_operator_cost based on the present situation. The choice of whether to parallelize can probably be made a manner similar to the choice to do an external sort: the planner guesses the outcome for costing purposes, but the actual decision is made at execution time. The planner would determine a tuple count cutoff at which parallelism becomes favorable, and tuplesort would check that to establish its actual decision. Use of parallelism within one process has a distributed effect on the system, similar to the use of work_mem. Add a PGC_USERSET GUC representing the number of worker processes the current session is willing to use. It would default to a small number, say zero or two. On a throughput-oriented system with high concurrent query counts, the DBA would tend to set/leave it at zero in postgresql.conf; parallelism can only help if the system has free resources. Separately, add a GUC limiting the total number of workers across the system at any one time. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Hi, Interesting! Need to think about most, but one piece immediately came to mind: On 2013-05-13 10:28:59 -0400, Noah Misch wrote: > Each worker needs to make SnapshotNow visibility decisions coherent with the > master. For sorting, this allows us to look up comparison functions, even > when the current transaction created or modified those functions. I don't really see how you can achieve that given how SnapshotNow works. There's nothing protecting you against one backend seeing changes made by another transaction while another doesn't see them. SnapshotNow doesn't even guarantee consistency within a single backend during a single scan... If you are meaning the above to just apply to changes made by the local "master" backend, sure I can see that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@leadboat.com> writes: > Each worker needs to make SnapshotNow visibility decisions coherent with the > master. For sorting, this allows us to look up comparison functions, even > when the current transaction created or modified those functions. This will > also be an essential building block for any parallelism project that consults > user tables. Implementing this means copying the subtransaction stack and the > combocid hash to each worker. > [ ... and GUC settings, and who knows what else ... ] This approach seems to me to be likely to guarantee that the startup overhead for any parallel sort is so large that only fantastically enormous sorts will come out ahead. I think you need to think in terms of restricting the problem space enough so that the worker startup cost can be trimmed to something reasonable. One obvious suggestion is to forbid the workers from doing any database access of their own at all --- the parent would have to do any required catalog lookups for sort functions etc. before forking the children. I think we should also seriously think about relying on fork() and copy-on-write semantics to launch worker subprocesses, instead of explicitly copying so much state over to them. Yes, this would foreclose ever having parallel query on Windows, but that's okay with me (hm, now where did I put my asbestos longjohns ...) Both of these lines of thought suggest that the workers should *not* be full-fledged backends. regards, tom lane
On 2013-05-13 10:57:39 -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Each worker needs to make SnapshotNow visibility decisions coherent with the > > master. For sorting, this allows us to look up comparison functions, even > > when the current transaction created or modified those functions. This will > > also be an essential building block for any parallelism project that consults > > user tables. Implementing this means copying the subtransaction stack and the > > combocid hash to each worker. > > > [ ... and GUC settings, and who knows what else ... ] > > This approach seems to me to be likely to guarantee that the startup > overhead for any parallel sort is so large that only fantastically > enormous sorts will come out ahead. I think if this is the way to go - and I am not sure it is - we need to use some worker pool that then are (re-)used everytime someone needs to do a sort. Which would be easier if backends could switch databases... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This approach seems to me to be likely to guarantee that the startup > overhead for any parallel sort is so large that only fantastically > enormous sorts will come out ahead. > > I think you need to think in terms of restricting the problem space > enough so that the worker startup cost can be trimmed to something > reasonable. One obvious suggestion is to forbid the workers from > doing any database access of their own at all --- the parent would > have to do any required catalog lookups for sort functions etc. > before forking the children. > > I think we should also seriously think about relying on fork() and > copy-on-write semantics to launch worker subprocesses, instead of > explicitly copying so much state over to them. Yes, this would > foreclose ever having parallel query on Windows, but that's okay > with me (hm, now where did I put my asbestos longjohns ...) > > Both of these lines of thought suggest that the workers should *not* > be full-fledged backends. Eventually, PostgreSQL needs not only parallel sort, but a more general parallel query facility. The goal here is not to design something specific to parallel sort, but to provide a general infrastructure for server-side parallelism. If we restrict ourselves to a design where syscache lookups aren't possible from a worker backend, I have trouble seeing how that's ever gonna work. That's a very low-level facility that a lot of things rely on. Even if you could make the shuttling of requests between master and slave transparent to the backend code, it's cutting into the amount of actual parallelizable stuff, and adding very significantly to the overhead. I don't see any reason to panic about the worker startup cost. I don't know whether the stuff that Noah mentioned copying will take 10 microseconds or 100 milliseconds, but there are plenty of sorts that take large numbers of seconds or minutes to happen, so there's still plenty of opportunity for win there. By definition, the things that you want to run in parallel are the ones that take a long time if you don't run them in parallel. Now, of course, if we can reduce the cost of starting new backends (e.g. by keeping them around from one parallel operation to the next, or by starting them via fork), that will expand the range of cases where parallelism is a win. But I think we could win in plenty of interesting real-world cases even if it took a full second to initialize each new worker, and surely it won't be nearly that bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 13, 2013 at 04:39:01PM +0200, Andres Freund wrote: > On 2013-05-13 10:28:59 -0400, Noah Misch wrote: > > Each worker needs to make SnapshotNow visibility decisions coherent with the > > master. For sorting, this allows us to look up comparison functions, even > > when the current transaction created or modified those functions. > > I don't really see how you can achieve that given how SnapshotNow > works. There's nothing protecting you against one backend seeing changes > made by another transaction while another doesn't see them. SnapshotNow > doesn't even guarantee consistency within a single backend during a > single scan... > If you are meaning the above to just apply to changes made by the local > "master" backend, sure I can see that. Yes; it only makes the workers consistent with the master to the extent that the master is consistent with itself. However, your comment makes me see that this may not be enough. For an object not protected by locks, if parallel execution probes the syscache N times where the current code probes it only once, we'll introduce new anomalies. I don't think this affects sorting in particular, because we already make little effort to behave sanely when a comparison function is redefined mid-sort. It seems likely that this will need a better answer sooner or later as we move into the parallelism space. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote: > The heavyweight locking mechanism will need to be aware of the association > between the master and its workers. Not sure I can see why that would be true. ISTM that the workers need to be restricted in various ways from a full-functioned master. If the workers can do things that conflict with the master and/or each other, we're going to have all kinds of hurt. The key is going to be deciding what the restrictions are and then working out how to enforce them. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<div dir="ltr">2013/5/13 Noah Misch <span dir="ltr"><<a href="mailto:noah@leadboat.com" target="_blank">noah@leadboat.com</a>></span><br/><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote"style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">* Planner &Similar Issues<br /><br /> We should decide whether to actually sort in parallel based on the comparator<br /> costand the data size. The system currently has no information on comparator<br /> cost: bt*cmp (and indeed almost all built-infunctions) all have procost=1,<br /> but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the<br/> procost estimates of all core B-tree and hash operators. This will have other<br /> benefits, but we will needto be cognizant of the risk of upsetting setups<br /> that have tuned cpu_operator_cost based on the present situation.<br/><br /> The choice of whether to parallelize can probably be made a manner similar to<br /> the choice to doan external sort: the planner guesses the outcome for costing<br /> purposes, but the actual decision is made at executiontime. The planner<br /> would determine a tuple count cutoff at which parallelism becomes favorable,<br /> andtuplesort would check that to establish its actual decision.<br /><br /></blockquote><div style="style">It probably crossoversmy <span style="color:rgb(0,0,0);font-family:Arial,'Arial New','\00ff2d\00ff33 \00ff30 \0030b4\0030b7\0030c3\0030af',sans-serif;font-size:13px">problemconsciousness to off-load CPU bounds</span></div><div style="style"><spanstyle="color:rgb(0,0,0);font-family:Arial,'Arial New','\00ff2d\00ff33 \00ff30 \0030b4\0030b7\0030c3\0030af',sans-serif;font-size:13px">workloads;that I partially tried to implement on writable foreigntable feature.</span></div><div style="style"><font color="#000000" face="Arial, Arial New, MS P ゴシック, sans-serif">Notonly sorting stuff, I think it may be worthful to have capability to push</font></div><div style="style"><fontcolor="#000000" face="Arial, Arial New, MS P ゴシック, sans-serif">heavy workload (like sort, aggregate orcomplex target-list) out external</font></div><div style="style"><font color="#000000" face="Arial, Arial New, MS P ゴシック,sans-serif">computing resources.</font></div><div style="style">However, I doubt whether the decision to parallelizeshould be done in</div><div style="style">execution time, rather than plan stage. For example, in case when we</div><divstyle="style">have enough number of records and 10-core multiprocessor, the wise</div><div style="style">planmay take parallel data load by 10-processors, partial-sort by 10-</div><div style="style">processors individually,then merge-sort. It needs fundamental different</div><div style="style">tree structure from the traditionalsingle-processors based plan-tree.</div><div style="style">So, it seems to me we should take an enhancement toallow to inject</div><div style="style">plan-tree special purpose parallel processing plan node.</div><div style="style">Howabout your opinion?</div><div style="style"><br /></div><div style="style">Thanks,</div></div></div></div>
On 13 May 2013 15:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you need to think in terms of restricting the problem space +1 > One obvious suggestion is to forbid the workers from > doing any database access of their own at all --- the parent would > have to do any required catalog lookups for sort functions etc. +1 > I think we should also seriously think about relying on fork() and > copy-on-write semantics to launch worker subprocesses, instead of > explicitly copying so much state over to them. Yes, this would > foreclose ever having parallel query on Windows, but that's okay > with me (hm, now where did I put my asbestos longjohns ...) If we relied on some kind of inherited state we could easily make the mistake of relying on something that isn't actually being maintained correctly in the worker. Luckily (?) that is exactly the stance we need to make this work on Windows. Other than that, releasing on Windows in later release sounds sensible, otherwise we'll just delay the development so much it will still happen in the "later" timeframe, just the chance of an earlier release on Linux/BSD will be missed. For example, the idea of managing separate subtransactions in each worker sounds nuts. Impressive, if you're already thinking about parallel DML that can self recover halfway through a statement and then continue processing, but that's a little advanced. The way to think about this is as a 10 year journey, not as a single feature. -1 for forking > Both of these lines of thought suggest that the workers should *not* > be full-fledged backends. +1 to the idea of workers != masters --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:
the execution of a function can be shipped safely to a worker based on its volatility
only? Immutable functions are presumably safe as they do not modify the database state
and give always the same result, volatile and stable functions are definitely not safe.
* Identifying Parallel-Compatible Functions
Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.
I am not sure that this problem is only limited to functions, but to all the expressions
and clauses of queries that could be shipped and evaluated on the worker backends when
fetching tuples that could be used to accelerate a parallel sort. Let's imagine for example
the case of a LIMIT clause that can be used by worker backends to limit the number of tuples
to sort as final result.
In some ways, Postgres-XC has faced (and is still facing) similar challenges and they have
been partially solved.
and clauses of queries that could be shipped and evaluated on the worker backends when
fetching tuples that could be used to accelerate a parallel sort. Let's imagine for example
the case of a LIMIT clause that can be used by worker backends to limit the number of tuples
to sort as final result.
In some ways, Postgres-XC has faced (and is still facing) similar challenges and they have
been partially solved.
I'm not sure what the specific answer here should look like. Simply having aHaving a flag would be enough to control parallelism, but cannot we also determine if
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because
the rules are liable to loosen over time.
the execution of a function can be shipped safely to a worker based on its volatility
only? Immutable functions are presumably safe as they do not modify the database state
and give always the same result, volatile and stable functions are definitely not safe.
For such reasons, it would be better to keep things simple and rely on simple rules to
determine if a given expression can be executed safely on a backend worker.
-- Michael
On Tue, May 14, 2013 at 12:51 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I'm not sure what the specific answer here should look like. Simply >> having a >> CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, >> because >> the rules are liable to loosen over time. > > Having a flag would be enough to control parallelism, but cannot we also > determine if > the execution of a function can be shipped safely to a worker based on its > volatility > only? Immutable functions are presumably safe as they do not modify the > database state > and give always the same result, volatile and stable functions are > definitely not safe. > For such reasons, it would be better to keep things simple and rely on > simple rules to > determine if a given expression can be executed safely on a backend worker. In the part of the text you didn't quote, Noah explained why not all immutable functions are parallel-safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 13, 2013 at 07:55:16PM +0100, Simon Riggs wrote: > On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote: > > > The heavyweight locking mechanism will need to be aware of the association > > between the master and its workers. > > Not sure I can see why that would be true. > > ISTM that the workers need to be restricted in various ways from a > full-functioned master. If the workers can do things that conflict > with the master and/or each other, we're going to have all kinds of > hurt. > > The key is going to be deciding what the restrictions are and then > working out how to enforce them. Yes, specific lock manager needs will depend on what we permit workers to do. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote: > 2013/5/13 Noah Misch <noah@leadboat.com> > > The choice of whether to parallelize can probably be made a manner similar > > to > > the choice to do an external sort: the planner guesses the outcome for > > costing > > purposes, but the actual decision is made at execution time. The planner > > would determine a tuple count cutoff at which parallelism becomes > > favorable, > > and tuplesort would check that to establish its actual decision. > > It probably crossovers my problem consciousness to off-load CPU bounds > workloads; that I partially tried to implement on writable foreign table > feature. > Not only sorting stuff, I think it may be worthful to have capability to > push > heavy workload (like sort, aggregate or complex target-list) out external > computing resources. > However, I doubt whether the decision to parallelize should be done in > execution time, rather than plan stage. For example, in case when we > have enough number of records and 10-core multiprocessor, the wise > plan may take parallel data load by 10-processors, partial-sort by 10- > processors individually, then merge-sort. It needs fundamental different > tree structure from the traditional single-processors based plan-tree. That's taking a few steps more in the direction of parallel general query; at some point, the planner would definitely become parallel-aware. For the narrower topic of parallel sort, I don't think it's necessary. The node tree supplying the sort can't run in parallel (yet), and the node pulling from the sort won't receive the first tuple until the sort is complete. For the time being, the planner just needs to know enough about the sort node's projected use of parallelism to estimate cost. > So, it seems to me we should take an enhancement to allow to inject > plan-tree special purpose parallel processing plan node. > How about your opinion? I'm not picturing how, specifically, this new node or class of nodes would be used. Could you elaborate? -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote: > On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote: > > > * Identifying Parallel-Compatible Functions > > > > Not all functions can reasonably run on a worker backend. We should not > > presume that a VOLATILE function can tolerate the unstable execution order > > imposed by parallelism, though a function like clock_timestamp() is > > perfectly > > reasonable to run that way. STABLE does not have that problem, but neither > > does it constitute a promise that the function implementation is compatible > > with parallel execution. Consider xid_age(), which would need code > > changes to > > operate correctly in parallel. IMMUTABLE almost guarantees enough; there > > may > > come a day when all IMMUTABLE functions can be presumed parallel-safe. For > > now, an IMMUTABLE function could cause trouble by starting a (read-only) > > subtransaction. The bottom line is that parallel-compatibility needs to be > > separate from volatility classes for the time being. > > > I am not sure that this problem is only limited to functions, but to all > the expressions > and clauses of queries that could be shipped and evaluated on the worker > backends when > fetching tuples that could be used to accelerate a parallel sort. Let's > imagine for example > the case of a LIMIT clause that can be used by worker backends to limit the > number of tuples > to sort as final result. It's true that the same considerations apply to other plan tree constructs; however, every such construct is known at build time, so we can study each one and decide how it fits with parallelism. Since functions are user-definable, it's preferable to reason about classes of functions. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Tue, May 14, 2013 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote: > On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote: >> 2013/5/13 Noah Misch <noah@leadboat.com> >> > The choice of whether to parallelize can probably be made a manner similar >> > to >> > the choice to do an external sort: the planner guesses the outcome for >> > costing >> > purposes, but the actual decision is made at execution time. The planner >> > would determine a tuple count cutoff at which parallelism becomes >> > favorable, >> > and tuplesort would check that to establish its actual decision. >> >> It probably crossovers my problem consciousness to off-load CPU bounds >> workloads; that I partially tried to implement on writable foreign table >> feature. >> Not only sorting stuff, I think it may be worthful to have capability to >> push >> heavy workload (like sort, aggregate or complex target-list) out external >> computing resources. >> However, I doubt whether the decision to parallelize should be done in >> execution time, rather than plan stage. For example, in case when we >> have enough number of records and 10-core multiprocessor, the wise >> plan may take parallel data load by 10-processors, partial-sort by 10- >> processors individually, then merge-sort. It needs fundamental different >> tree structure from the traditional single-processors based plan-tree. > > That's taking a few steps more in the direction of parallel general query; at > some point, the planner would definitely become parallel-aware. For the > narrower topic of parallel sort, I don't think it's necessary. The node tree > supplying the sort can't run in parallel (yet), and the node pulling from the > sort won't receive the first tuple until the sort is complete. For the time > being, the planner just needs to know enough about the sort node's projected > use of parallelism to estimate cost. You know what would be a low-hanging fruit that I've been thinking would benefit many of my own queries? "Parallel" sequential scan nodes. Even if there's no real parallelism involved, when a query has to scan the same table at multiple nodes, if it's big, it would be worth parallelizing the scans to transform them into synchro scans. I have absolutely no idea how this would work easily without forked workers, because the scans might be buried in more complex execution trees. But still, it's worth considering, that parallelizing may benefit more than core usage. If execution nodes could be paused at arbitrary points, a "parallel scan" node could pause one branch that has consumed the circular buffer, letting another branches consume their part, and thus "parallelizing" branch execution. But this would be perhaps more complex than simply forking.
On Tue, May 14, 2013 at 11:59 PM, Noah Misch <noah@leadboat.com> wrote:
--
Michael
It's true that the same considerations apply to other plan tree constructs;On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote:
> On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:
>
> > * Identifying Parallel-Compatible Functions
> >
> > Not all functions can reasonably run on a worker backend. We should not
> > presume that a VOLATILE function can tolerate the unstable execution order
> > imposed by parallelism, though a function like clock_timestamp() is
> > perfectly
> > reasonable to run that way. STABLE does not have that problem, but neither
> > does it constitute a promise that the function implementation is compatible
> > with parallel execution. Consider xid_age(), which would need code
> > changes to
> > operate correctly in parallel. IMMUTABLE almost guarantees enough; there
> > may
> > come a day when all IMMUTABLE functions can be presumed parallel-safe. For
> > now, an IMMUTABLE function could cause trouble by starting a (read-only)
> > subtransaction. The bottom line is that parallel-compatibility needs to be
> > separate from volatility classes for the time being.
> >
> I am not sure that this problem is only limited to functions, but to all
> the expressions
> and clauses of queries that could be shipped and evaluated on the worker
> backends when
> fetching tuples that could be used to accelerate a parallel sort. Let's
> imagine for example
> the case of a LIMIT clause that can be used by worker backends to limit the
> number of tuples
> to sort as final result.
however, every such construct is known at build time, so we can study each one
and decide how it fits with parallelism.
The concept of clause parallelism for backend worker is close to the concept of clause shippability introduced in Postgres-XC. In the case of XC, the equivalent of the master backend is a backend located on a node called Coordinator that merges and organizes results fetched in parallel from remote nodes where data scans occur (on nodes called Datanodes). The backends used for tuple scans across Datanodes share the same data visibility as they use the same snapshot and transaction ID as the backend on Coordinator. This is different from the parallelism as there is no idea of snapshot import to worker backends.
However, the code in XC planner used for clause shippability evaluation is definitely worth looking at just considering the many similarities it shares with parallelism when evaluating if a given clause can be executed on a worker backend or not. It would be a waste to implement twice the same thing is there is code already available.
However, the code in XC planner used for clause shippability evaluation is definitely worth looking at just considering the many similarities it shares with parallelism when evaluating if a given clause can be executed on a worker backend or not. It would be a waste to implement twice the same thing is there is code already available.
Since functions are user-definable, it's preferable to reason about classes of functions.
Yes. You are right.
Michael
On 05/13/2013 07:10 PM, Robert Haas wrote: > On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This approach seems to me to be likely to guarantee that the startup >> overhead for any parallel sort is so large that only fantastically >> enormous sorts will come out ahead. >> >> I think you need to think in terms of restricting the problem space >> enough so that the worker startup cost can be trimmed to something >> reasonable. One obvious suggestion is to forbid the workers from >> doing any database access of their own at all --- the parent would >> have to do any required catalog lookups for sort functions etc. >> before forking the children. >> >> I think we should also seriously think about relying on fork() and >> copy-on-write semantics to launch worker subprocesses, instead of >> explicitly copying so much state over to them. Yes, this would >> foreclose ever having parallel query on Windows, but that's okay >> with me (hm, now where did I put my asbestos longjohns ...) >> >> Both of these lines of thought suggest that the workers should *not* >> be full-fledged backends. > Eventually, PostgreSQL needs not only parallel sort, but a more > general parallel query facility. The goal here is not to design > something specific to parallel sort, but to provide a general > infrastructure for server-side parallelism. If we restrict ourselves > to a design where syscache lookups aren't possible from a worker > backend, I have trouble seeing how that's ever gonna work. That's a > very low-level facility that a lot of things rely on. Even if you > could make the shuttling of requests between master and slave > transparent to the backend code, it's cutting into the amount of > actual parallelizable stuff, and adding very significantly to the > overhead. Has anybody looked into making syscache MVCC compliant ? A lot of complexity would be avoided if there were some efficient way to have the same MVCC power in syscache as there is in the rest of the system. SO instead of solving the cache transactional coherency problems separately in each user of syscache why not bring the syscache to the level of rest of postgres ? > I don't see any reason to panic about the worker startup cost. I > don't know whether the stuff that Noah mentioned copying will take 10 > microseconds or 100 milliseconds, but there are plenty of sorts that > take large numbers of seconds or minutes to happen, so there's still > plenty of opportunity for win there. By definition, the things that > you want to run in parallel are the ones that take a long time if you > don't run them in parallel. Now, of course, if we can reduce the cost > of starting new backends (e.g. by keeping them around from one > parallel operation to the next, or by starting them via fork), that > will expand the range of cases where parallelism is a win. But I > think we could win in plenty of interesting real-world cases even if > it took a full second to initialize each new worker, and surely it > won't be nearly that bad. +1 to there being lots of use cases for high startup cost parallelism. There are lots of non-parallel query plans as well which have high startup cost and postgresql optimiser already deals with them quite well. Hannu >
Hannu Krosing <hannu@krosing.net> writes: > Has anybody looked into making syscache MVCC compliant ? This is the wrong statement of the question. The right statement is "how would you like backend A to be updating table T in compliance with a view of table T's schema that is obsolete because of backend B's already-committed DDL changes?" For example, ignoring a just-committed CHECK constraint because it's not visible to your transaction's snapshot. The point of SnapshotNow catalog lookups is to be sure that we see the current definition of a table whether or not we're in a transaction whose snapshot precedes the last DDL change to that table. There might be some other way to deal with that consistency problem, but just changing syscache's behavior is not going to make things better. regards, tom lane
On 05/15/2013 07:30 AM, Tom Lane wrote: > Hannu Krosing <hannu@krosing.net> writes: >> Has anybody looked into making syscache MVCC compliant ? > This is the wrong statement of the question. > > The right statement is "how would you like backend A to be updating > table T in compliance with a view of table T's schema that is obsolete > because of backend B's already-committed DDL changes?" Could we not treat it the same way we treat updated rows in serializable transaction mode ? That is rollback and let the client retry if it so wishes ? > For example, ignoring a just-committed CHECK constraint because it's not visible > to your transaction's snapshot. Is there anything in SQL standard that tells us to handle schema changes in parallel to DML in any other way than by rollback ? I agree it is nice to have a way to carry on in this case, but there is only so much you can sensibly do. For example I can see no good way to handle parallel DROP COLUMN, especially if you are trying to update that same column based on your old snapshot. > The point of SnapshotNow catalog lookups is to be sure that we see the > current definition of a table whether or not we're in a transaction > whose snapshot precedes the last DDL change to that table. Can't we just detect that "there have been changes" and rollback if trying any DML on changed tables. It does not seem like something what happens too often. > There might > be some other way to deal with that consistency problem, but just > changing syscache's behavior is not going to make things better. I was targeting mainly the read-only case of querying data from a changed table. I have not checked how SnapshotNow catalog handle this, but I guess it is not trivial in case there have been changes to catalog since the active snapshot was taken. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On Monday, May 13, 2013 7:59 PM Noah Misch wrote: > It would be great if one client session could take advantage of > multiple CPU > cores. EnterpriseDB wishes to start the trek into this problem space > for 9.4 > by implementing parallel internal (i.e. not spilling to disk) sort. > This > touches on a notable subset of the infrastructure components we'll need > for > parallel general query. My intent is to map out the key design topics, > hear > about critical topics I hadn't considered, and solicit feedback on the > quality > of the high-level plan. Full designs for key pieces will come later. > > > * Worker Backends > > A multi-process model, rather than a multi-thread, is best for > introducing > parallelism to a PostgreSQL session. With threading, all in-memory > state > would become shared by default; with processes, in-memory data > structures > remain unshared until we explicitly select them for sharing. Some data > structures will require new sharing, but I expect unshared ones to > remain the > typical case. I refer to these additional processes as worker > backends. They > will have much in common with ordinary client backends, including a > database > connection and a PGPROC. They will have no client socket; instead, > they will > take direction from the client-connected backend, termed the master, > via > shared memory. > We can allocate a small amount of permanent shared memory for > coordination > among a group of processes, but sorting will benefit from a region as > large as > maintenance_work_mem. Expect on-demand memory sharing. Will the shared memory used for coordinating tuples between master and worker be fixed or varying depending on size of tuples to be sorted or number of workers associated. If it is varying, then it can sometimes encounter situation where required memory is not available and in that case it has to revert to serial sorting > > * Identifying Parallel-Compatible Functions > > Not all functions can reasonably run on a worker backend. We should > not > presume that a VOLATILE function can tolerate the unstable execution > order > imposed by parallelism, though a function like clock_timestamp() is > perfectly > reasonable to run that way. STABLE does not have that problem, but > neither > does it constitute a promise that the function implementation is > compatible > with parallel execution. Consider xid_age(), which would need code > changes to > operate correctly in parallel. IMMUTABLE almost guarantees enough; > there may > come a day when all IMMUTABLE functions can be presumed parallel-safe. > For > now, an IMMUTABLE function could cause trouble by starting a (read- > only) > subtransaction. The bottom line is that parallel-compatibility needs > to be > separate from volatility classes for the time being. > > I'm not sure what the specific answer here should look like. Simply > having a > CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, > because > the rules are liable to loosen over time. > > > * Planner & Similar Issues > > We should decide whether to actually sort in parallel based on the > comparator > cost and the data size. The system currently has no information on > comparator > cost: bt*cmp (and indeed almost all built-in functions) all have > procost=1, > but bttextcmp is at least 1000x slower than btint4cmp. Let's improve > the > procost estimates of all core B-tree and hash operators. This will > have other > benefits, but we will need to be cognizant of the risk of upsetting > setups > that have tuned cpu_operator_cost based on the present situation. > > The choice of whether to parallelize can probably be made a manner > similar to > the choice to do an external sort: the planner guesses the outcome for > costing > purposes, but the actual decision is made at execution time. The > planner > would determine a tuple count cutoff at which parallelism becomes > favorable, > and tuplesort would check that to establish its actual decision. > > Use of parallelism within one process has a distributed effect on the > system, > similar to the use of work_mem. Add a PGC_USERSET GUC representing the > number > of worker processes the current session is willing to use. It would > default > to a small number, say zero or two. On a throughput-oriented system > with high > concurrent query counts, the DBA would tend to set/leave it at zero in > postgresql.conf; parallelism can only help if the system has free > resources. > Separately, add a GUC limiting the total number of workers across the > system > at any one time. How will the parallel sorting tasks be divided and assigned to each worker? With Regards, Amit Kapila.
On Tue, May 14, 2013 at 12:15:24PM -0300, Claudio Freire wrote: > You know what would be a low-hanging fruit that I've been thinking > would benefit many of my own queries? > > "Parallel" sequential scan nodes. Even if there's no real parallelism > involved, when a query has to scan the same table at multiple nodes, > if it's big, it would be worth parallelizing the scans to transform > them into synchro scans. > > I have absolutely no idea how this would work easily without forked > workers, because the scans might be buried in more complex execution > trees. But still, it's worth considering, that parallelizing may > benefit more than core usage. > > If execution nodes could be paused at arbitrary points, a "parallel > scan" node could pause one branch that has consumed the circular > buffer, letting another branches consume their part, and thus > "parallelizing" branch execution. But this would be perhaps more > complex than simply forking. Execution nodes do pause between every output tuple, at least nominally. Still, given the architecture of our executor and the planner work to implement such a thing, I would not classify it as low-hanging fruit. It would primarily apply to a plan with independent sequential scans of the same large (relative to total memory) relation. I'm sure that comes up, but it doesn't strike me as typical. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Wed, May 15, 2013 at 08:12:34AM +0900, Michael Paquier wrote: > The concept of clause parallelism for backend worker is close to the > concept of clause shippability introduced in Postgres-XC. In the case of > XC, the equivalent of the master backend is a backend located on a node > called Coordinator that merges and organizes results fetched in parallel > from remote nodes where data scans occur (on nodes called Datanodes). The > backends used for tuple scans across Datanodes share the same data > visibility as they use the same snapshot and transaction ID as the backend > on Coordinator. This is different from the parallelism as there is no idea > of snapshot import to worker backends. Worker backends would indeed share snapshot and XID. > However, the code in XC planner used for clause shippability evaluation is > definitely worth looking at just considering the many similarities it > shares with parallelism when evaluating if a given clause can be executed > on a worker backend or not. It would be a waste to implement twice the same > thing is there is code already available. Agreed. Local parallel query is very similar to distributed query; the specific IPC cost multipliers differ, but that's about it. I hope we can benefit from XC's experience in this area. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Wed, May 15, 2013 at 12:26:52PM +0530, Amit Kapila wrote: > On Monday, May 13, 2013 7:59 PM Noah Misch wrote: > > We can allocate a small amount of permanent shared memory for > > coordination > > among a group of processes, but sorting will benefit from a region as > > large as > > maintenance_work_mem. Expect on-demand memory sharing. > > Will the shared memory used for coordinating tuples between master and > worker be fixed or varying depending on size of tuples to be sorted or > number of workers associated. > If it is varying, then it can sometimes encounter situation where required > memory is not available and in that case it has to revert to serial sorting > How will the parallel sorting tasks be divided and assigned to each worker? I haven't selected answers for those details, yet. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Mon, May 13, 2013 at 7:28 AM, Noah Misch <noah@leadboat.com> wrote: > We should decide whether to actually sort in parallel based on the comparator > cost and the data size. The system currently has no information on comparator > cost: bt*cmp (and indeed almost all built-in functions) all have procost=1, > but bttextcmp is at least 1000x slower than btint4cmp. I think that this effort could justify itself independently of any attempt to introduce parallelism to in-memory sorting. I abandoned a patch to introduce timsort to Postgres, because I knew that there was no principled way to reap the benefits. Unless you introduce parallelism, it's probably going to be virtually impossible to come up with an alogorithm that does in-memory sorting faster (in terms of the amount of system time taken) than a highly optimized quicksort when sorting integers. But sorting types with really expensive comparators (even considerably more expensive than bttextcmp) for pass-by-reference Datums (where the memory locality advantage of quicksort doesn't really help so much) makes timsort much more compelling. That's why it's used for Python lists. -- Peter Geoghegan
On 2013-05-13 10:28:59 -0400, Noah Misch wrote: > Each worker needs to make SnapshotNow visibility decisions coherent with the > master. For sorting, this allows us to look up comparison functions, even > when the current transaction created or modified those functions. This will > also be an essential building block for any parallelism project that consults > user tables. Implementing this means copying the subtransaction stack and the > combocid hash to each worker. For the sake of completeness, we should also > copy the global MVCC snapshot data (sorting probably won't care). It also > means forbidding, while a parallel task is in flight, operations that affect > the transaction state: Btw, if you assume you can simply copy a snapshot from the normal backend to the worker backend to make visibility decisions in the general case: You're wrong. Unfortunately you need in-memory state to make sense of combocids... Not impossible to solve, but you should be aware of the issue. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 15, 2013 at 11:32 AM, Peter Geoghegan <pg@heroku.com> wrote: > I think that this effort could justify itself independently of any > attempt to introduce parallelism to in-memory sorting. I abandoned a > patch to introduce timsort to Postgres, because I knew that there was > no principled way to reap the benefits. Just for the record, I attach a patch that introduces a timsort_arg function as a drop-in replacement for quicksort_arg (including replacing all of the specializations that went into 9.2). It has been rebased against master. For what it's worth, if anyone wanted to pick this up, that would be fine with me. Don't be fooled by the superficial regression test failures. The tests in question are subtly wrong, because they rely on a certain ordering that isn't explicitly requested. Timsort is stable, whereas quicksort generally isn't stable (our implementation certainly isn't). -- Peter Geoghegan
Attachment
On Wed, May 15, 2013 at 3:04 PM, Noah Misch <noah@leadboat.com> wrote: > On Tue, May 14, 2013 at 12:15:24PM -0300, Claudio Freire wrote: >> You know what would be a low-hanging fruit that I've been thinking >> would benefit many of my own queries? >> >> "Parallel" sequential scan nodes. Even if there's no real parallelism >> involved, when a query has to scan the same table at multiple nodes, >> if it's big, it would be worth parallelizing the scans to transform >> them into synchro scans. >> >> I have absolutely no idea how this would work easily without forked >> workers, because the scans might be buried in more complex execution >> trees. But still, it's worth considering, that parallelizing may >> benefit more than core usage. >> >> If execution nodes could be paused at arbitrary points, a "parallel >> scan" node could pause one branch that has consumed the circular >> buffer, letting another branches consume their part, and thus >> "parallelizing" branch execution. But this would be perhaps more >> complex than simply forking. > > Execution nodes do pause between every output tuple, at least nominally. > Still, given the architecture of our executor and the planner work to > implement such a thing, I would not classify it as low-hanging fruit. It > would primarily apply to a plan with independent sequential scans of the same > large (relative to total memory) relation. I'm sure that comes up, but it > doesn't strike me as typical. I found it rather typical of some of my workloads, but it could probably not be the case globally. It would be rather easier if it could pause without returning rows. I think ATM, not returning any rows means the node is finished doing its scan. The nodes that would have to be pausable like this wouldn't be sequential scans, but sorts, hashes, and in general those that take a long time to start returning rows. So, a plan that goes like: Seq on A -> Sort -> Merge -> result Seq on A -> Sort --/ Would be turned into: Seq on A -> Step Sort -> Parallel Merge -> result Seq on A -> Step Sort --/ Or even maybe Seq on A -> Sort -> Tee X -> Parallel Merge X --/ I think Tee and Parallel Merge should be doable with current infrastructure, because they don't require pausing without returning any tuples. Not sure how may meters above ground that is, or how many gotchas might be involved. But it's been circling in my head for a while.
On Wed, May 15, 2013 at 08:49:00PM +0200, Andres Freund wrote: > On 2013-05-13 10:28:59 -0400, Noah Misch wrote: > > Each worker needs to make SnapshotNow visibility decisions coherent with the > > master. For sorting, this allows us to look up comparison functions, even > > when the current transaction created or modified those functions. This will > > also be an essential building block for any parallelism project that consults > > user tables. Implementing this means copying the subtransaction stack and the > > combocid hash to each worker. For the sake of completeness, we should also > > copy the global MVCC snapshot data (sorting probably won't care). It also > > means forbidding, while a parallel task is in flight, operations that affect > > the transaction state: > > Btw, if you assume you can simply copy a snapshot from the normal > backend to the worker backend to make visibility decisions in the > general case: You're wrong. Unfortunately you need in-memory state to > make sense of combocids... Correct. If you think of any required state information that I did not list above, please let me know. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Wed, May 15, 2013 at 11:11 AM, Noah Misch <noah@leadboat.com> wrote:
On Wed, May 15, 2013 at 08:12:34AM +0900, Michael Paquier wrote:Worker backends would indeed share snapshot and XID.
> The concept of clause parallelism for backend worker is close to the
> concept of clause shippability introduced in Postgres-XC. In the case of
> XC, the equivalent of the master backend is a backend located on a node
> called Coordinator that merges and organizes results fetched in parallel
> from remote nodes where data scans occur (on nodes called Datanodes). The
> backends used for tuple scans across Datanodes share the same data
> visibility as they use the same snapshot and transaction ID as the backend
> on Coordinator. This is different from the parallelism as there is no idea
> of snapshot import to worker backends.Agreed. Local parallel query is very similar to distributed query; the
> However, the code in XC planner used for clause shippability evaluation is
> definitely worth looking at just considering the many similarities it
> shares with parallelism when evaluating if a given clause can be executed
> on a worker backend or not. It would be a waste to implement twice the same
> thing is there is code already available.
specific IPC cost multipliers differ, but that's about it. I hope we can
benefit from XC's experience in this area.
I believe the parallel execution is much easier to be done if the data is partitioned. Of course it is possible to make only the sort operation parallel but then the question would be how to split and pass each tuple to workers. XC and Greenplum use notion of hash distributed table that enables the parallel sort (XC doesn't perform parallel sort on replicated table, I guess). For postgres, I don't think hash distributed table is foreseeable option, but MergeAppend over inheritance is a good choice to run in parallel. You won't even need to modify many lines of sort execution code if you correctly dispatch the work, as it's just to split and assign the subnode of query plan to workers. Transactions and locks will be tricky though, and we might end up introducing small set of snapshot sharing infra for the former and notion of session id rather than process id for the latter. I don't think SnapshotNow is the problem as anyway executor is reading catalogs with that today.
Thanks,
Hitoshi
Thanks,
Hitoshi
--
Hitoshi Harada
On 5/13/13 9:28 AM, Noah Misch wrote: > It would be great if one client session could take advantage of multiple CPU > cores. EnterpriseDB wishes to start the trek into this problem space for 9.4 > by implementing parallel internal (i.e. not spilling to disk) sort. This > touches on a notable subset of the infrastructure components we'll need for > parallel general query. My intent is to map out the key design topics, hear > about critical topics I hadn't considered, and solicit feedback on the quality > of the high-level plan. Full designs for key pieces will come later. Have you considered GPU-based sorting? I know there's been discussion in the past. To me, the biggest advantage of GPU sorting is that most of the concerns you've laid out go away; a backend that needs tosort just throws data at the GPU to do the actual sorting; all the MVCC issues and what not remain within the scope ofa single backend. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
> Have you considered GPU-based sorting? I know there's been discussion in the past. If you use OpenCL, then you can use a CPU driver if there is no GPU, and that can allow you to leverage all the CPU cores without having to do the multi-thread stuff in the backend. While the compilation of a specific kernel can be quite expensive, it also has the effect of a JIT compiler in terms of system independence.
Let me introduce one thing we discussed in the developer meeting at Ottawa. We got a consensus that pluggable exec-node may be useful to replace a part of exec-node tree with an alternative one being implemented by extensions; which will allow to run something like "GpuSort" instead of existing Sort. http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting#Pluggable_plan.2Fexec_nodes 2013/5/24 james <james@mansionfamily.plus.com>: >> Have you considered GPU-based sorting? I know there's been discussion in >> the past. > > If you use OpenCL, then you can use a CPU driver if there is no GPU, and > that can allow you to leverage all the CPU cores without having to do the > multi-thread stuff in the backend. > > While the compilation of a specific kernel can be quite expensive, it also > has the effect of a JIT compiler in terms of system independence. > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Fri, May 24, 2013 at 01:13:21PM -0500, Jim Nasby wrote: > On 5/13/13 9:28 AM, Noah Misch wrote: >> It would be great if one client session could take advantage of multiple CPU >> cores. EnterpriseDB wishes to start the trek into this problem space for 9.4 >> by implementing parallel internal (i.e. not spilling to disk) sort. This >> touches on a notable subset of the infrastructure components we'll need for >> parallel general query. My intent is to map out the key design topics, hear >> about critical topics I hadn't considered, and solicit feedback on the quality >> of the high-level plan. Full designs for key pieces will come later. > > Have you considered GPU-based sorting? I know there's been discussion in the past. I had considered it briefly. Parallel sort is mainly valuable for expensive comparison operators. Sorting int4, for example, is too cheap for parallelism to be compelling. (In my test build of a 16 GiB int4 index, sorting took 11s of the 391s build time.) However, expensive operators are also liable to be difficult to reimplement for the GPU. In particular, implementing a GPU-based strcoll() for bttextcmp sounds like quite a project in its own right. > To me, the biggest advantage of GPU sorting is that most of the concerns you've laid out go away; a backend that needsto sort just throws data at the GPU to do the actual sorting; all the MVCC issues and what not remain within the scopeof a single backend. Those are matters we would eventually need to address as we parallelize more things, so I regard confronting them as an advantage. Among other benefits, this project is a vehicle for emplacing some infrastructure without inviting the full complexity entailed by loftier goals. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
* Noah Misch (noah@leadboat.com) wrote: > In particular, implementing a GPU-based strcoll() for bttextcmp > sounds like quite a project in its own right. It also wouldn't likely be helpful... To be able to use a GPU effectively, last I looked, you need to be able to move a large chunk of data to the GPU's memory, operate on it, and then bulk move the results back because the cost of moving data between main memory and GPU memory is very high. > Those are matters we would eventually need to address as we parallelize more > things, so I regard confronting them as an advantage. Among other benefits, > this project is a vehicle for emplacing some infrastructure without inviting > the full complexity entailed by loftier goals. Agreed. Thanks, Stephen