Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

From Tels
Subject Re: [HACKERS] Parallel Append implementation
Date
Msg-id 68a7cc5f4d7c3b68f82bac5a95d97a3f.squirrel@sm.webmail.pair.com
Whole thread Raw
In response to Re: [HACKERS] Parallel Append implementation  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: [HACKERS] Parallel Append implementation  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Moin,

On Fri, March 10, 2017 3:24 am, Amit Khandekar wrote:
> On 10 March 2017 at 12:33, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>>
>>>> But as far as code is concerned, I think the two-list approach will
>>>> turn out to be less simple if we derive corresponding two different
>>>> arrays in AppendState node. Handling two different arrays during
>>>> execution does not look clean. Whereas, the bitmapset that I have used
>>>> in Append has turned out to be very simple. I just had to do the below
>>>> check (and that is the only location) to see if it's a partial or
>>>> non-partial subplan. There is nowhere else any special handling for
>>>> non-partial subpath.
>>>>
>>>> /*
>>>> * Increment worker count for the chosen node, if at all we found one.
>>>> * For non-partial plans, set it to -1 instead, so that no other
>>>> workers
>>>> * run it.
>>>> */
>>>> if (min_whichplan != PA_INVALID_PLAN)
>>>> {
>>>>    if (bms_is_member(min_whichplan,
>>>> ((Append*)state->ps.plan)->partial_subplans_set))
>>>>            padesc->pa_info[min_whichplan].pa_num_workers++;
>>>>    else
>>>>            padesc->pa_info[min_whichplan].pa_num_workers = -1;
>>>> }
>>>>
>>>> Now, since Bitmapset field is used during execution with such
>>>> simplicity, why not have this same data structure in AppendPath, and
>>>> re-use bitmapset field in Append plan node without making a copy of
>>>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
>>>> Append, again there is going to be code for data structure conversion.
>>>>
>>>
>>> I think there is some merit in separating out non-parallel and
>>> parallel plans within the same array or outside it. The current logic
>>> to assign plan to a worker looks at all the plans, unnecessarily
>>> hopping over the un-parallel ones after they are given to a worker. If
>>> we separate those two, we can keep assigning new workers to the
>>> non-parallel plans first and then iterate over the parallel ones when
>>> a worker needs a plan to execute. We might eliminate the need for
>>> special value -1 for num workers. You may separate those two kinds in
>>> two different arrays or within the same array and remember the
>>> smallest index of a parallel plan.
>
> Do you think we might get performance benefit with this ? I am looking
> more towards logic simplicity. non-parallel plans would be mostly
> likely be there only in case of UNION ALL queries, and not partitioned
> tables. And UNION ALL queries probably would have far lesser number of
> subplans, there won't be too many unnecessary hops. The need for
> num_workers=-1 will still be there for partial plans, because we need
> to set it to -1 once a worker finishes a plan.
>
>>
>> Further to that, with this scheme and the scheme to distribute workers
>> equally irrespective of the maximum workers per plan, you don't need
>> to "scan" the subplans to find the one with minimum workers. If you
>> treat the array of parallel plans as a circular queue, the plan to be
>> assigned next to a worker will always be the plan next to the one
>> which got assigned to the given worker.
>
>> Once you have assigned workers
>> to non-parallel plans, intialize a shared variable next_plan to point
>> to the first parallel plan. When a worker comes asking for a plan,
>> assign the plan pointed by next_plan and update it to the next plan in
>> the circular queue.
>
> At some point of time, this logic may stop working. Imagine plans are
> running with (1, 1, 1). Next worker goes to plan 1, so they run with
> (2, 1, 1). So now the next_plan points to plan 2. Now suppose worker
> on plan 2 finishes. It should not again take plan 2, even though
> next_plan points to 2. It should take plan 3, or whichever is not
> finished. May be a worker that finishes a plan should do this check
> before directly going to the next_plan. But if this is turning out as
> simple as the finding-min-worker-plan, we can use this logic. But will
> have to check. We can anyway consider this even when we have a single
> list.

Just a question for me to understand the implementation details vs. the
strategy:

Have you considered how the scheduling decision might impact performance
due to "inter-plan parallelism vs. in-plan parallelism"?

So what would be the scheduling strategy? And should there be a fixed one
or user-influencable? And what could be good ones?

A simple example:

E.g. if we have 5 subplans, and each can have at most 5 workers and we
have 5 workers overall.

So, do we:
 Assign 5 workers to plan 1. Let it finish. Then assign 5 workers to plan 2. Let it finish. and so on

or:
 Assign 1 workers to each plan until no workers are left?

In the second case you would have 5 plans running in a quasy-sequential
manner, which might be slower than the other way. Or not, that probably
needs some benchmarks?

Likewise, if you have a mix of plans with max workers like:
 Plan A: 1 worker Plan B: 2 workers Plan C: 3 workers Plan D: 1 worker Plan E: 4 workers

Would the strategy be:
* Serve them in first-come-first-served order? (A,B,C,D?) (Would order
here be random due to how the plan's emerge, i.e. could the user re-order
query to get a different order?)* Serve them in max-workers order? (A,D,B,C)* Serve first all with 1 worker, then fill
therest? (A,D,B,C | A,D,C,B)* Serve them by some other metric, e.g. index-only scans first, seq-scans
 
last? Or a mix of all these?

Excuse me if I just didn't see this from the thread so far. :)

Best regards,

Tels



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Report the number of skipped frozen pages by manual VACUUM