Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers
From | Tels |
---|---|
Subject | Re: [HACKERS] Parallel Append implementation |
Date | |
Msg-id | 68a7cc5f4d7c3b68f82bac5a95d97a3f.squirrel@sm.webmail.pair.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Append implementation (Amit Khandekar <amitdkhan.pg@gmail.com>) |
Responses |
Re: [HACKERS] Parallel Append implementation
|
List | pgsql-hackers |
Moin, On Fri, March 10, 2017 3:24 am, Amit Khandekar wrote: > On 10 March 2017 at 12:33, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>>> >>>> But as far as code is concerned, I think the two-list approach will >>>> turn out to be less simple if we derive corresponding two different >>>> arrays in AppendState node. Handling two different arrays during >>>> execution does not look clean. Whereas, the bitmapset that I have used >>>> in Append has turned out to be very simple. I just had to do the below >>>> check (and that is the only location) to see if it's a partial or >>>> non-partial subplan. There is nowhere else any special handling for >>>> non-partial subpath. >>>> >>>> /* >>>> * Increment worker count for the chosen node, if at all we found one. >>>> * For non-partial plans, set it to -1 instead, so that no other >>>> workers >>>> * run it. >>>> */ >>>> if (min_whichplan != PA_INVALID_PLAN) >>>> { >>>> if (bms_is_member(min_whichplan, >>>> ((Append*)state->ps.plan)->partial_subplans_set)) >>>> padesc->pa_info[min_whichplan].pa_num_workers++; >>>> else >>>> padesc->pa_info[min_whichplan].pa_num_workers = -1; >>>> } >>>> >>>> Now, since Bitmapset field is used during execution with such >>>> simplicity, why not have this same data structure in AppendPath, and >>>> re-use bitmapset field in Append plan node without making a copy of >>>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in >>>> Append, again there is going to be code for data structure conversion. >>>> >>> >>> I think there is some merit in separating out non-parallel and >>> parallel plans within the same array or outside it. The current logic >>> to assign plan to a worker looks at all the plans, unnecessarily >>> hopping over the un-parallel ones after they are given to a worker. If >>> we separate those two, we can keep assigning new workers to the >>> non-parallel plans first and then iterate over the parallel ones when >>> a worker needs a plan to execute. We might eliminate the need for >>> special value -1 for num workers. You may separate those two kinds in >>> two different arrays or within the same array and remember the >>> smallest index of a parallel plan. > > Do you think we might get performance benefit with this ? I am looking > more towards logic simplicity. non-parallel plans would be mostly > likely be there only in case of UNION ALL queries, and not partitioned > tables. And UNION ALL queries probably would have far lesser number of > subplans, there won't be too many unnecessary hops. The need for > num_workers=-1 will still be there for partial plans, because we need > to set it to -1 once a worker finishes a plan. > >> >> Further to that, with this scheme and the scheme to distribute workers >> equally irrespective of the maximum workers per plan, you don't need >> to "scan" the subplans to find the one with minimum workers. If you >> treat the array of parallel plans as a circular queue, the plan to be >> assigned next to a worker will always be the plan next to the one >> which got assigned to the given worker. > >> Once you have assigned workers >> to non-parallel plans, intialize a shared variable next_plan to point >> to the first parallel plan. When a worker comes asking for a plan, >> assign the plan pointed by next_plan and update it to the next plan in >> the circular queue. > > At some point of time, this logic may stop working. Imagine plans are > running with (1, 1, 1). Next worker goes to plan 1, so they run with > (2, 1, 1). So now the next_plan points to plan 2. Now suppose worker > on plan 2 finishes. It should not again take plan 2, even though > next_plan points to 2. It should take plan 3, or whichever is not > finished. May be a worker that finishes a plan should do this check > before directly going to the next_plan. But if this is turning out as > simple as the finding-min-worker-plan, we can use this logic. But will > have to check. We can anyway consider this even when we have a single > list. Just a question for me to understand the implementation details vs. the strategy: Have you considered how the scheduling decision might impact performance due to "inter-plan parallelism vs. in-plan parallelism"? So what would be the scheduling strategy? And should there be a fixed one or user-influencable? And what could be good ones? A simple example: E.g. if we have 5 subplans, and each can have at most 5 workers and we have 5 workers overall. So, do we: Assign 5 workers to plan 1. Let it finish. Then assign 5 workers to plan 2. Let it finish. and so on or: Assign 1 workers to each plan until no workers are left? In the second case you would have 5 plans running in a quasy-sequential manner, which might be slower than the other way. Or not, that probably needs some benchmarks? Likewise, if you have a mix of plans with max workers like: Plan A: 1 worker Plan B: 2 workers Plan C: 3 workers Plan D: 1 worker Plan E: 4 workers Would the strategy be: * Serve them in first-come-first-served order? (A,B,C,D?) (Would order here be random due to how the plan's emerge, i.e. could the user re-order query to get a different order?)* Serve them in max-workers order? (A,D,B,C)* Serve first all with 1 worker, then fill therest? (A,D,B,C | A,D,C,B)* Serve them by some other metric, e.g. index-only scans first, seq-scans last? Or a mix of all these? Excuse me if I just didn't see this from the thread so far. :) Best regards, Tels
pgsql-hackers by date: