Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [DESIGN] ParallelAppend
Date
Msg-id CAA4eK1LgUxjRbi-CbhpiXE_NMJhup9JVEw=HMp87wfL9EdLUMg@mail.gmail.com
Whole thread Raw
In response to Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [DESIGN] ParallelAppend  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Nov 14, 2015 at 3:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.
>

Couple of comments and questions regarding this patch:

1.
+/*
+ * add_partial_path
..
+ *  produce the same number of rows.  Neither do we need to consider startup
+ *  costs: parallelism 
is only used for plans that will be run to completion.

A.
Don't we need the startup cost incase we need to build partial paths for
joinpaths like mergepath?
Also, I think there are other cases for single relation scan where startup
cost can matter like when there are psuedoconstants in qualification
(refer cost_qual_eval_walker()) or let us say if someone has disabled
seq scan (disable_cost is considered as startup cost.)

B. I think partial path is an important concept and desrves some
explanation in src/backend/optimizer/README.
There is already a good explanation about Paths, so I think it
seems that it is better to add explanation about partial paths.

2.
+ *  costs: parallelism is only used for plans that will be run to completion.
+ *    Therefore, this 
routine is much simpler than add_path: it needs to
+ *    consider only pathkeys and total cost.

There seems to be some spacing issue in last two lines.

3.
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+ int parallel_threshold = 1000;
+ int parallel_degree = 1;
+
+ /*
+ * If this relation is too small to be worth a parallel scan, just return
+ * without doing anything ... unless it's an inheritance child.  In that case,
+ * we want to generate a parallel path here anyway.  It might not be worthwhile
+ * just for this relation, but when combined with all of its inheritance siblings
+ * it may well pay off.
+ */
+ if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+ return;

A.
This means that for inheritance child relations for which rel pages are
less than parallel_threshold, it will always consider the cost shared
between 1 worker and leader as per below calc in cost_seqscan:
if (path->parallel_degree > 0)
run_cost = run_cost / (path->parallel_degree + 0.5);

I think this might not be the appropriate cost model for even for
non-inheritence relations which has pages more than parallel_threshold,
but it seems to be even worst for inheritance children which have
pages less than parallel_threshold

B.
Will it be possible that if none of the inheritence child rels (or very few
of them) are big enough for parallel scan, then considering Append
node for parallelism of any use or in other words, won't it be better
to generate plan as it is done now without this patch for such cases
considering current execution model of Gather node?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Next
From: Michael Paquier
Date:
Subject: Re: Additional role attributes && superuser review