Thread: Observations in Parallel Append

Observations in Parallel Append

From
Amit Kapila
Date:
Few observations in Parallel Append commit (ab727167)

1.
+++ b/src/include/nodes/execnodes.h
@@ -21,6 +21,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/spin.h"
..

There doesn't seem to be any need for including spin.h.  I think some
prior version of the patch might need it.  Patch attached to remove
it.

2.
+ *     choose_next_subplan_for_worker
..
+ *     We start from the first plan and advance through the list;
+ *     when we get back to the end, we loop back to the first
+ *     nonpartial plan.
..
+choose_next_subplan_for_worker(AppendState *node)
{
..
+       if (pstate->pa_next_plan < node->as_nplans - 1)
+       {
+           /* Advance to next plan. */
+           pstate->pa_next_plan++;
+       }
+       else if (append->first_partial_plan < node->as_nplans)
+       {
+           /* Loop back to first partial plan. */
+           pstate->pa_next_plan = append->first_partial_plan;
+       }
..

The code and comment don't seem to match.  The comments indicate that
after reaching the end of the list, we loop back to first nonpartial
plan whereas code indicates that we loop back to first partial plan.
I think one of those needs to be changed unless I am missing something
obvious.

3.
+cost_append(AppendPath *apath)
{
..
+           /*
+            * Apply parallel divisor to non-partial subpaths.  Also add the
+            * cost of partial paths to the total cost, but ignore non-partial
+            * paths for now.
+            */
+           if (i < apath->first_partial_path)
+               apath->path.rows += subpath->rows / parallel_divisor;
+           else
+           {
+               apath->path.rows += subpath->rows;
+               apath->path.total_cost += subpath->total_cost;
+           }
..
}

I think it is better to use clamp_row_est for rows for the case where
we use parallel_divisor so that the value of rows is always sane.
Also, don't we need to use parallel_divisor for partial paths instead
of non-partial paths as those will be actually distributed among
workers?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Observations in Parallel Append

From
Robert Haas
Date:
On Fri, Dec 22, 2017 at 6:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> There doesn't seem to be any need for including spin.h.  I think some
> prior version of the patch might need it.  Patch attached to remove
> it.

OK, good catch.

> The code and comment don't seem to match.  The comments indicate that
> after reaching the end of the list, we loop back to first nonpartial
> plan whereas code indicates that we loop back to first partial plan.
> I think one of those needs to be changed unless I am missing something
> obvious.

Yeah, the header comment should say partial, not nonpartial.

> 3.
> +cost_append(AppendPath *apath)
> {
> ..
> +           /*
> +            * Apply parallel divisor to non-partial subpaths.  Also add the
> +            * cost of partial paths to the total cost, but ignore non-partial
> +            * paths for now.
> +            */
> +           if (i < apath->first_partial_path)
> +               apath->path.rows += subpath->rows / parallel_divisor;
> +           else
> +           {
> +               apath->path.rows += subpath->rows;
> +               apath->path.total_cost += subpath->total_cost;
> +           }
> ..
> }
>
> I think it is better to use clamp_row_est for rows for the case where
> we use parallel_divisor so that the value of rows is always sane.

Good point.

> Also, don't we need to use parallel_divisor for partial paths instead
> of non-partial paths as those will be actually distributed among
> workers?

Uh, that seems backwards to me.  We're trying to estimate the average
number of rows per worker.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Observations in Parallel Append

From
Amit Kapila
Date:
On Sun, Dec 24, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 22, 2017 at 6:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>> Also, don't we need to use parallel_divisor for partial paths instead
>> of non-partial paths as those will be actually distributed among
>> workers?
>
> Uh, that seems backwards to me.  We're trying to estimate the average
> number of rows per worker.
>

Okay, but is it appropriate to use the parallel_divisor?  The
parallel_divisor means the contribution of all the workers (+
leader_contribution) whereas for non-partial paths there will be
always only the subset of workers which will operate on them.
Consider a case with one non-partial subpath and five partial subpaths
with six as parallel_divisor, now the current code will try to divide
the rows of non-partial subpath with respect to six workers.  However,
in reality, there will always be one worker which will execute that
path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Observations in Parallel Append

From
Robert Haas
Date:
On Sun, Dec 24, 2017 at 8:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Dec 24, 2017 at 12:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Dec 22, 2017 at 6:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>> Also, don't we need to use parallel_divisor for partial paths instead
>>> of non-partial paths as those will be actually distributed among
>>> workers?
>>
>> Uh, that seems backwards to me.  We're trying to estimate the average
>> number of rows per worker.
>
> Okay, but is it appropriate to use the parallel_divisor?  The
> parallel_divisor means the contribution of all the workers (+
> leader_contribution) whereas for non-partial paths there will be
> always only the subset of workers which will operate on them.
> Consider a case with one non-partial subpath and five partial subpaths
> with six as parallel_divisor, now the current code will try to divide
> the rows of non-partial subpath with respect to six workers.  However,
> in reality, there will always be one worker which will execute that
> path.

That's true, of course, but if five processes each return 0 rows and
the sixth process returns 600 rows, the average number of rows per
process is 100, not anything else.

Here's one way to look at it.  Suppose there is a table with 1000
partitions.  If we do a Parallel Append over a Parallel Seq Scan per
partition, we will come up with a row estimate by summing the
estimated row count across all partitions and dividing by the
parallel_divisor.  This will give us some answer.  If we instead do a
Parallel Append over a Seq Scan per partition, we should really come
up with the *same* estimate.  The only way to do that is to also
divide by the parallel_divisor in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Observations in Parallel Append

From
Amit Kapila
Date:
On Wed, Dec 27, 2017 at 12:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Dec 24, 2017 at 8:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Okay, but is it appropriate to use the parallel_divisor?  The
>> parallel_divisor means the contribution of all the workers (+
>> leader_contribution) whereas for non-partial paths there will be
>> always only the subset of workers which will operate on them.
>> Consider a case with one non-partial subpath and five partial subpaths
>> with six as parallel_divisor, now the current code will try to divide
>> the rows of non-partial subpath with respect to six workers.  However,
>> in reality, there will always be one worker which will execute that
>> path.
>
> That's true, of course, but if five processes each return 0 rows and
> the sixth process returns 600 rows, the average number of rows per
> process is 100, not anything else.
>
> Here's one way to look at it.  Suppose there is a table with 1000
> partitions.  If we do a Parallel Append over a Parallel Seq Scan per
> partition, we will come up with a row estimate by summing the
> estimated row count across all partitions and dividing by the
> parallel_divisor.  This will give us some answer.  If we instead do a
> Parallel Append over a Seq Scan per partition, we should really come
> up with the *same* estimate.  The only way to do that is to also
> divide by the parallel_divisor in this case.
>

The theory sounds good, but it looks like that the code doesn't behave
in the same way.  To verify the theory, I and my colleague Dilip had
run few tests, the output of which is as below:

Test setup
-------------------
create table t(a int, b varchar) partition by range(a);
create table t1 partition of t for values from (0) to (100000);
create table t2 partition of t for values from (100000) to (200000);
create table t3 partition of t for values from (200000) to (300000);
create table t4 partition of t for values from (300000) to (400000);
insert into t values (generate_series(1,399999), repeat('x', 50));
analyze t;

Test
-------
set parallel_tuple_cost=0;
set parallel_setup_cost=0;
set max_parallel_workers_per_gather=6;

postgres=# explain select * from t;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=0.00..6476.94 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..6476.94 rows=235295 width=55)
         ->  Parallel Seq Scan on t1  (cost=0.00..1619.23 rows=58823 width=55)
         ->  Parallel Seq Scan on t2  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t3  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t4  (cost=0.00..1619.24 rows=58824 width=55)
(7 rows)


postgres=# alter table t1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain select * from t;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=0.00..6888.70 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..6888.70 rows=208730 width=55)
         ->  Seq Scan on t1  (cost=0.00..2030.99 rows=99999 width=55)
         ->  Parallel Seq Scan on t2  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t3  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t4  (cost=0.00..1619.24 rows=58824 width=55)
(7 rows)


postgres=# alter table t2 set (parallel_workers=0);
ALTER TABLE
postgres=# alter table t3 set (parallel_workers=0);
ALTER TABLE
postgres=# alter table t4 set (parallel_workers=0);
ALTER TABLE
postgres=# explain select * from t;
                              QUERY PLAN
-----------------------------------------------------------------------
 Gather  (cost=0.00..4061.99 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..4061.99 rows=129032 width=55)
         ->  Seq Scan on t2  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t3  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t4  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t1  (cost=0.00..2030.99 rows=99999 width=55)
(7 rows)


The value of rows in Parallel Append is different for different cases:
(a) Row estimate for Parallel Append over a Parallel Seq Scan per
partition - 235295
(b) Row estimate for Parallel Append over a mix of Parallel Seq Scan
and Seq Scan per partition - 208730
(c) Row estimate for Parallel Append over a Seq Scan per partition - 129032

I think the reason for different estimates is that we use a different
value of parallel divisor for partial paths.  If the row estimation
for partial paths uses the scaled value of parallel divisor, then we
get consistent results for row estimation.  I have tried the below
change:

+ apath->path.rows += (subpath->rows * subpath_parallel_divisor /
+ parallel_divisor);

The results after the change:
postgres=# explain select * from t;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=0.00..6476.94 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..6476.94 rows=129032 width=55)
         ->  Parallel Seq Scan on t1  (cost=0.00..1619.23 rows=58823 width=55)
         ->  Parallel Seq Scan on t2  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t3  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t4  (cost=0.00..1619.24 rows=58824 width=55)
(7 rows)

postgres=# alter table t1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain select * from t;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather  (cost=0.00..6888.70 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..6888.70 rows=129032 width=55)
         ->  Seq Scan on t1  (cost=0.00..2030.99 rows=99999 width=55)
         ->  Parallel Seq Scan on t2  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t3  (cost=0.00..1619.24 rows=58824 width=55)
         ->  Parallel Seq Scan on t4  (cost=0.00..1619.24 rows=58824 width=55)
(7 rows)

postgres=# alter table t4 set (parallel_workers=0);
ALTER TABLE
postgres=# alter table t3 set (parallel_workers=0);
ALTER TABLE
postgres=# alter table t2 set (parallel_workers=0);
ALTER TABLE
postgres=# explain select * from t;
                              QUERY PLAN
----------------------------------------------------------------------
 Gather  (cost=0.00..4061.99 rows=399999 width=55)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.00..4061.99 rows=129032 width=55)
         ->  Seq Scan on t2  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t3  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t4  (cost=0.00..2031.00 rows=100000 width=55)
         ->  Seq Scan on t1  (cost=0.00..2030.99 rows=99999 width=55)
(7 rows)

Note that after the change the row estimation (129032) for Parallel
Append is consistent for all the three kinds of cases.

Attached, please find the patch which fixes this issue (Thanks to
Dilip for helping me in identifying the above case and fix).  I have
also modified the comment atop function
choose_next_subplan_for_worker() as discussed above.  The change to
remove unnecessary inclusion of spin.h is still in a separate patch as
posted above.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Observations in Parallel Append

From
Robert Haas
Date:
On Tue, Jan 2, 2018 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Attached, please find the patch which fixes this issue (Thanks to
> Dilip for helping me in identifying the above case and fix).  I have
> also modified the comment atop function
> choose_next_subplan_for_worker() as discussed above.  The change to
> remove unnecessary inclusion of spin.h is still in a separate patch as
> posted above.

Here's a combined patch with some cosmetic changes which I will commit
if it looks OK to you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Observations in Parallel Append

From
Amit Kapila
Date:
On Thu, Jan 4, 2018 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 2, 2018 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Attached, please find the patch which fixes this issue (Thanks to
>> Dilip for helping me in identifying the above case and fix).  I have
>> also modified the comment atop function
>> choose_next_subplan_for_worker() as discussed above.  The change to
>> remove unnecessary inclusion of spin.h is still in a separate patch as
>> posted above.
>
> Here's a combined patch with some cosmetic changes which I will commit
> if it looks OK to you.
>

Looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Observations in Parallel Append

From
Robert Haas
Date:
On Wed, Jan 3, 2018 at 10:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Here's a combined patch with some cosmetic changes which I will commit
>> if it looks OK to you.
>
> Looks good to me.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company