Re: parallelizing subplan execution - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: parallelizing subplan execution |
Date | |
Msg-id | 4B80E2D1.2040307@catalyst.net.nz Whole thread Raw |
In response to | Re: parallelizing subplan execution (was: explain and PARAM_EXEC) (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Robert Haas wrote: > > > It seems to me that you need to start by thinking about what kinds of > queries could be usefully parallelized. What I think you're proposing > here, modulo large amounts of hand-waving, is that we should basically > find a branch of the query tree, cut it off, and make that branch the > responsibility of a subprocess. What kinds of things would be > sensible to hand off in this way? Well, you'd want to find nodes that > are not likely to be repeatedly re-executed with different parameters, > like subplans or inner-indexscans, because otherwise you'll get > pipeline stalls handing the new parameters back and forth. And you > want to find nodes that are expensive for the same reason. So maybe > this would work for something like a merge join on top of two sorts - > one backend could perform each sort, and then whichever one was the > child would stream the tuples to the parent for the final merge. Of > course, this assumes the I/O subsystem can keep up, which is not a > given - if both tables are fed by the same, single spindle, it might > be worse than if you just did the sorts consecutively. > > This approach might also benefit queries that are very CPU-intensive, > on a multi-core system with spare cycles. Suppose you have a big tall > stack of hash joins, each with a small inner rel. The child process > does about half the joins and then pipelines the results into the > parent, which does the other half and returns the results. > > But there's at least one other totally different way of thinking about > this problem, which is that you might want two processes to cooperate > in executing the SAME query node - imagine, for example, a big > sequential scan with an expensive but highly selective filter > condition, or an enormous sort. You have all the same problems of > figuring out when it's actually going to help, of course, but the > details will likely be quite different. > > I'm not really sure which one of these would be more useful in > practice - or maybe there are even other strategies. What does > $COMPETITOR do? > > I'm also ignoring the difficulties of getting hold of a second backend > in the right state - same database, same snapshot, etc. It seems to > me unlikely that there are a substantial number of real-world > applications for which this will not work very well if we have to > actually start a new backend every time we want to parallelize a > query. IOW, we're going to need, well, a connection pool in core. > *ducks, runs for cover* > > One thing that might work quite well is slicing up by partition (properly implemented partitioning would go along with this nicely too...) regards Mark
pgsql-hackers by date: