Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Parallel Append implementation
Date
Msg-id 20170407150532.6jbaqrhdnezb5tiw@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Parallel Append implementation  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: [HACKERS] Parallel Append implementation  (Amit Khandekar <amitdkhan.pg@gmail.com>)
List pgsql-hackers
Hi,

On 2017-04-07 11:44:39 +0530, Amit Khandekar wrote:
> On 6 April 2017 at 07:33, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
> >> This is what the earlier versions of my patch had done : just add up
> >> per-subplan parallel_workers (1 for non-partial subplan and
> >> subpath->parallel_workers for partial subplans) and set this total as
> >> the Append parallel_workers.
> >
> > I don't think that's great, consider e.g. the case that you have one
> > very expensive query, and a bunch of cheaper ones. Most of those workers
> > wouldn't do much while waiting for the the expensive query.  What I'm
> > basically thinking we should do is something like the following
> > pythonesque pseudocode:
> >
> > best_nonpartial_cost = -1
> > best_nonpartial_nworkers = -1
> >
> > for numworkers in 1...#max workers:
> >    worker_work = [0 for x in range(0, numworkers)]
> >
> >    nonpartial_cost += startup_cost * numworkers
> >
> >    # distribute all nonpartial tasks over workers.  Assign tasks to the
> >    # worker with the least amount of work already performed.
> >    for task in all_nonpartial_subqueries:
> >        least_busy_worker = worker_work.smallest()
> >        least_busy_worker += task.total_nonpartial_cost
> >
> >    # the nonpartial cost here is the largest amount any single worker
> >    # has to perform.
> >    nonpartial_cost += worker_work.largest()
> >
> >    total_partial_cost = 0
> >    for task in all_partial_subqueries:
> >        total_partial_cost += task.total_nonpartial_cost
> >
> >    # Compute resources needed by partial tasks. First compute how much
> >    # cost we can distribute to workers that take shorter than the
> >    # "busiest" worker doing non-partial tasks.
> >    remaining_avail_work = 0
> >    for i in range(0, numworkers):
> >        remaining_avail_work += worker_work.largest() - worker_work[i]
> >
> >    # Equally divide up remaining work over all workers
> >    if remaining_avail_work < total_partial_cost:
> >       nonpartial_cost += (worker_work.largest - remaining_avail_work) / numworkers
> >
> >    # check if this is the best number of workers
> >    if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
> >       best_nonpartial_cost = worker_work.largest
> >       best_nonpartial_nworkers = nworkers
> >
> > Does that make sense?
> 
> Yeah, I gather what you are trying to achieve is : allocate number of
> workers such that the total cost does not exceed the cost of the first
> non-partial plan (i.e. the costliest one, because the plans are sorted
> by descending cost).
> 
> So for non-partial costs such as (20, 10, 5, 2) allocate only 2
> workers because the 2nd worker will execute (10, 5, 2) while 1st
> worker executes (20).
> 
> But for costs such as (4, 4, 4,  .... 20 times), the logic would give
> us 20 workers because we want to finish the Append in 4 time units;
> and this what we want to avoid when we go with
> don't-allocate-too-many-workers approach.

I guess, my problem is that I don't agree with that as a goal in an of
itself.  If you actually want to run your query quickly, you *want* 20
workers here.  The issues of backend startup overhead (already via
parallel_setup_cost), concurrency and such cost should be modelled, not
burried in a formula the user can't change.  If we want to make it less
and less likely to start more workers we should make that configurable,
not the default.
I think there's some precedent taken from the parallel seqscan case,
that's not actually applicable here.  Parallel seqscans have a good
amount of shared state, both on the kernel and pg side, and that shared
state reduces gains of increasing the number of workers.  But with
non-partial scans such shared state largely doesn't exist.


> > especially if we get partitionwise joins.
> 
> About that I am not sure, because we already have support for parallel
> joins, so wouldn't the join subpaths corresponding to all of the
> partitions be partial paths ? I may be wrong about that.

We'll probably generate both, and then choose the cheaper one.  The
startup cost for partitionwise joins should usually be a lot cheaper
(because e.g. for hashtables we'll generate smaller hashtables), and we
should have less cost of concurrency.


> But if the subplans are foreign scans, then yes all would be
> non-partial plans. This may provoke  off-topic discussion, but here
> instead of assigning so many workers to all these foreign plans and
> all those workers waiting for the results, a single asynchronous
> execution node (which is still in the making) would be desirable
> because it would do the job of all these workers.

That's something that probably shouldn't be modelled throug a parallel
append, I agree - it shouldn't be too hard to develop a costing model
for that however.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Transactions involving multiple postgres foreign servers
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] Supporting huge pages on Windows