Re: parallelizing subplan execution - Mailing list pgsql-hackers

From Mark Kirkwood
Subject Re: parallelizing subplan execution
Date
Msg-id 4B80E2D1.2040307@catalyst.net.nz
Whole thread Raw
In response to Re: parallelizing subplan execution (was: explain and PARAM_EXEC)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
>
>
> It seems to me that you need to start by thinking about what kinds of
> queries could be usefully parallelized.  What I think you're proposing
> here, modulo large amounts of hand-waving, is that we should basically
> find a branch of the query tree, cut it off, and make that branch the
> responsibility of a subprocess.  What kinds of things would be
> sensible to hand off in this way?  Well, you'd want to find nodes that
> are not likely to be repeatedly re-executed with different parameters,
> like subplans or inner-indexscans, because otherwise you'll get
> pipeline stalls handing the new parameters back and forth.  And you
> want to find nodes that are expensive for the same reason.  So maybe
> this would work for something like a merge join on top of two sorts -
> one backend could perform each sort, and then whichever one was the
> child would stream the tuples to the parent for the final merge.  Of
> course, this assumes the I/O subsystem can keep up, which is not a
> given - if both tables are fed by the same, single spindle, it might
> be worse than if you just did the sorts consecutively.
>
> This approach might also benefit queries that are very CPU-intensive,
> on a multi-core system with spare cycles.  Suppose you have a big tall
> stack of hash joins, each with a small inner rel.  The child process
> does about half the joins and then pipelines the results into the
> parent, which does the other half and returns the results.
>
> But there's at least one other totally different way of thinking about
> this problem, which is that you might want two processes to cooperate
> in executing the SAME query node - imagine, for example, a big
> sequential scan with an expensive but highly selective filter
> condition, or an enormous sort.  You have all the same problems of
> figuring out when it's actually going to help, of course, but the
> details will likely be quite different.
>
> I'm not really sure which one of these would be more useful in
> practice - or maybe there are even other strategies.  What does
> $COMPETITOR do?
>
> I'm also ignoring the difficulties of getting hold of a second backend
> in the right state - same database, same snapshot, etc.  It seems to
> me unlikely that there are a substantial number of real-world
> applications for which this will not work very well if we have to
> actually start a new backend every time we want to parallelize a
> query.  IOW, we're going to need, well, a connection pool in core.
> *ducks, runs for cover*
>
>   

One thing that might work quite well is slicing up by partition 
(properly implemented partitioning would go along with this nicely too...)

regards

Mark



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: PGXS: REGRESS_OPTS=--load-language=plpgsql
Next
From: Pavel Stehule
Date:
Subject: Re: scheduler in core