Thread: [HACKERS] parallelize queries containing initplans

[HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
By seeing the subject line, one might wonder why we need to consider
parallelizing the queries containing initplans differently from
queries containing subplans considering that I have posted a mail to
achieve later a few hours back.  The reason is that both are treated
differently with respect to parallelism and otherwise as well and both
can be parallelized in a different way depending on the design we
choose.  InitPlans can be used in three forms (a) a Param node
representing a single scalar result (b) a row comparison tree
containing multiple Param nodes (c) NULL constant for MULTIEXPR
subquery whereas SubPlans are used as SubPlan nodes. Here, I am
primarily interested in parallelizing queries that contain InitPlans
of the form (a) and the reason is that I have seen that form used more
as compared to other forms (primarily based on a study of TPC-H and
TPC-DS workloads).  However, if we find that parallelizing other forms
can be done along with it easily, then that is excellent.  To start
with let us see the plan of TPC-H query (Q-22) and understand how it
can be improved.

Limit  InitPlan 1 (returns $0)    ->  Finalize Aggregate          ->  Gather                Workers Planned: 2
     ->  Partial Aggregate                      ->  Parallel Seq Scan on customer customer_1
Filter:(...)  ->  GroupAggregate        Group Key: ("substring"((customer.c_phone)::text, 1, 2))        ->  Sort
     Sort Key: ("substring"((customer.c_phone)::text, 1, 2))              ->  Nested Loop Anti Join
-> Seq Scan on customer                          Filter: ((c_acctbal > $0) AND (...)))                    ->  Index
OnlyScan using idx_orders_custkey on orders                          Index Cond: (o_custkey = customer.c_custkey)
 


In the above plan, we can see that the join on customer and orders
table (Nested Loop Anti Join) is not parallelised even though we have
the capability to parallelize Nested Loop Joins. The reason for not
choosing the parallel plan is that one of the nodes (Seq Scan on
customer) is referring to initplan and we consider such nodes as
parallel-restricted which means they can't be parallelised.  Now, I
could see three ways of parallelizing such a query.  The first way is
that we just push parallel-safe initplans to workers and allow them to
execute it, the drawback of this approach is that it won't be able to
push initplans in cases as shown above where initplan is
parallel-unsafe (contains Gather node) and second is we will lose the
expectation of single evaluation.  The second way is that we always
execute the initplan in the master backend and pass the resultant
value to the worker, this will allow above form of plans to push
initplans to workers and hence can help in enabling parallelism for
other nodes in plan tree.   The drawback of the second approach is
that we need to evaluate the initplan before it is actually required
which means that we might evaluate it even when it is not required.  I
am not sure if it is always safe to assume that we can evaluate the
initplan before pushing it to workers especially for the cases when it
is far enough down in the plan tree which we are parallelizing,
however, I think we can assume it when the iniplan is above the plan
tree where it is used (like in the above case).  The third way is that
we allow Gather node to be executed below another Gather node, but I
think that will be bad especially for the plans like above because
each worker needs to further spawn another set of workers to evaluate
the iniplan which could be done once.  Now we can build some way such
that only one of the workers executes such an initplan and share the
values with other workers, but I think overall this requires much more
effort than first or second approach.

Among all the three approaches, first seems to be simpler than the
other two, but I feel if we just do that then we leave a lot on the
table.   Another way to accomplish this project could be that we do a
mix of first and second such that when the initplan is above the plan
tree to be parallelized, then use the second approach (one-time
evaluation by master backend and share the result with workers),
otherwise use the first approach of pushing down the initplan to
workers.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Dec 28, 2016 at 5:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>  To start
> with let us see the plan of TPC-H query (Q-22) and understand how it
> can be improved.
>
> Limit
>    InitPlan 1 (returns $0)
>      ->  Finalize Aggregate
>            ->  Gather
>                  Workers Planned: 2
>                  ->  Partial Aggregate
>                        ->  Parallel Seq Scan on customer customer_1
>                              Filter: (...)
>    ->  GroupAggregate
>          Group Key: ("substring"((customer.c_phone)::text, 1, 2))
>          ->  Sort
>                Sort Key: ("substring"((customer.c_phone)::text, 1, 2))
>                ->  Nested Loop Anti Join
>                      ->  Seq Scan on customer
>                            Filter: ((c_acctbal > $0) AND (...)))
>                      ->  Index Only Scan using idx_orders_custkey on orders
>                            Index Cond: (o_custkey = customer.c_custkey)
>
>
> In the above plan, we can see that the join on customer and orders
> table (Nested Loop Anti Join) is not parallelised even though we have
> the capability to parallelize Nested Loop Joins. The reason for not
> choosing the parallel plan is that one of the nodes (Seq Scan on
> customer) is referring to initplan and we consider such nodes as
> parallel-restricted which means they can't be parallelised.  Now, I
> could see three ways of parallelizing such a query.  The first way is
> that we just push parallel-safe initplans to workers and allow them to
> execute it, the drawback of this approach is that it won't be able to
> push initplans in cases as shown above where initplan is
> parallel-unsafe (contains Gather node) and second is we will lose the
> expectation of single evaluation.  The second way is that we always
> execute the initplan in the master backend and pass the resultant
> value to the worker, this will allow above form of plans to push
> initplans to workers and hence can help in enabling parallelism for
> other nodes in plan tree.
>

I have used the second way to parallelize queries containing initplans
as that can help in cases where initplans in itself also uses
parallelism and it will also retain an existing expectation of single
evaluation for initplans. The basic idea as mentioned in above mail is
to evaluate the initplans at Gather node and pass the value to worker
backends which can use it as required. The patch has used
*plan->allParam* bitmapset to evaluate the initplans at Gather node
(we traverse the planstate tree to find params at each node and we
take care to avoid multiple evaluations of same initplan).  To
identify initplan params among other params in *allParams*, the patch
has added an additional bool variable (isinitplan) in ParamExecData.
We can do it in some other way as well if there is any better
suggestion.

The patch has also changed the explain output of queries where
initplan param is evaluated at Gather node. For ex.

postgres=# explain (costs off) select t1.i from t1, t2 where t1.j=t2.j
and t1.k < (select max(k) from t3) and t1.k < (select max(k) from t3);
                       QUERY PLAN
--------------------------------------------------------
 Hash Join
   Hash Cond: (t2.j = t1.j)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 1
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3
   InitPlan 2 (returns $1)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 1
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3 t3_1
   ->  Gather
         Workers Planned: 1
         ->  Parallel Seq Scan on t2
   ->  Hash
         ->  Gather
               Workers Planned: 1
               Params Evaluated: $0, $1
               ->  Parallel Seq Scan on t1
                     Filter: ((k < $0) AND (k < $1))
(23 rows)


In the above plan, you can notice a line (Params Evaluated: $0, $1)
which indicates the params evaluated at Gather node.  As of now,
explain just uses the *allParam* params present at the Gather node,
but we need to traverse the planstate tree as we do during execution.
This patch gives 2.5~3x performance benefit for Q-22 of TPC-H.

>   The drawback of the second approach is
> that we need to evaluate the initplan before it is actually required
> which means that we might evaluate it even when it is not required.  I
> am not sure if it is always safe to assume that we can evaluate the
> initplan before pushing it to workers especially for the cases when it
> is far enough down in the plan tree which we are parallelizing,
>

I think we can always pull up un-correlated initplans at Gather node,
however, if there is a correlated initplan, then it is better not to
allow such initplans for being pushed below gather.  Ex. of correlated
initplans:

postgres=# explain (costs off) select * from t1 where t1.i in (select
t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
                  QUERY PLAN
----------------------------------------------
 Seq Scan on t1
   Filter: (SubPlan 2)
   SubPlan 2
     ->  Gather
           Workers Planned: 1
           Params Evaluated: $1
           InitPlan 1 (returns $1)
             ->  Aggregate
                   ->  Seq Scan on t3
                         Filter: (i = t1.i)
           ->  Result
                 One-Time Filter: (t1.k = $1)
                 ->  Parallel Seq Scan on t2
(13 rows)

It might be safe to allow above plan, but in general, such plans
should not be allowed, because it might not be feasible to compute
such initplan references at Gather node.  I am still thinking on the
best way to deal with such initplans.

Thoughts?

Thanks to Kuntal who is a co-author of this patch for doing the
investigation along with me of different plans which contain
references to initplans.


Note - This patch needs to be applied on top of subplan patches [1][2].

[1] - https://www.postgresql.org/message-id/CAA4eK1KYQjQzQMpEz%2BQRA2fmim386gQLQBEf%2Bp2Wmtqjh1rjwg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1LK3NjNY4ghHUOwYfBFa%2BAb2SvccTKAxTHbOdW1NhUjvQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Tue, Jan 31, 2017 at 4:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 28, 2016 at 5:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>   The drawback of the second approach is
>> that we need to evaluate the initplan before it is actually required
>> which means that we might evaluate it even when it is not required.  I
>> am not sure if it is always safe to assume that we can evaluate the
>> initplan before pushing it to workers especially for the cases when it
>> is far enough down in the plan tree which we are parallelizing,
>>
>
> I think we can always pull up un-correlated initplans at Gather node,
> however, if there is a correlated initplan, then it is better not to
> allow such initplans for being pushed below gather.  Ex. of correlated
> initplans:
>
> postgres=# explain (costs off) select * from t1 where t1.i in (select
> t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                   QUERY PLAN
> ----------------------------------------------
>  Seq Scan on t1
>    Filter: (SubPlan 2)
>    SubPlan 2
>      ->  Gather
>            Workers Planned: 1
>            Params Evaluated: $1
>            InitPlan 1 (returns $1)
>              ->  Aggregate
>                    ->  Seq Scan on t3
>                          Filter: (i = t1.i)
>            ->  Result
>                  One-Time Filter: (t1.k = $1)
>                  ->  Parallel Seq Scan on t2
> (13 rows)
>
> It might be safe to allow above plan, but in general, such plans
> should not be allowed, because it might not be feasible to compute
> such initplan references at Gather node.  I am still thinking on the
> best way to deal with such initplans.
>

I could see two possibilities to determine whether the plan (for which
we are going to generate an initplan) contains a reference to a
correlated var param node.  One is to write a plan or path walker to
determine any such reference and the second is to keep the information
about the correlated param in path node.   I think the drawback of the
first approach is that traversing path tree during generation of
initplan can be costly, so for now I have kept the information in path
node to prohibit generating parallel initplans which contain a
reference to correlated vars. I think we can go with first approach of
using path walker if people feel that is better than maintaining a
reference in path.  Attached patch
prohibit_parallel_correl_params_v1.patch implements the second
approach of keeping the correlated var param reference in path node
and pq_pushdown_initplan_v2.patch uses that to generate parallel
initplans.

Thoughts?

These patches build on top of parallel subplan patch [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1KYQjQzQMpEz+QRA2fmim386gQLQBEf+p2Wmtqjh1rjwg@mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Fri, Feb 10, 2017 at 4:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I could see two possibilities to determine whether the plan (for which
> we are going to generate an initplan) contains a reference to a
> correlated var param node.  One is to write a plan or path walker to
> determine any such reference and the second is to keep the information
> about the correlated param in path node.   I think the drawback of the
> first approach is that traversing path tree during generation of
> initplan can be costly, so for now I have kept the information in path
> node to prohibit generating parallel initplans which contain a
> reference to correlated vars. I think we can go with first approach of
> using path walker if people feel that is better than maintaining a
> reference in path.  Attached patch
> prohibit_parallel_correl_params_v1.patch implements the second
> approach of keeping the correlated var param reference in path node
> and pq_pushdown_initplan_v2.patch uses that to generate parallel
> initplans.
>

Two weeks back when Robert was in Bangalore, we (myself, Kuntal and
Robert) had a discussion on this patch.   He mentioned that the idea
of pulling up initplans (uncorrelated initplans) at Gather node (and
then execute them and share the values to each worker) used in this
patch doesn't sound appealing and has a chance of bugs in some corner
cases. We discussed an idea where the first worker to access the
initplan will evaluate it and then share the value with other
participating processes, but with that, we won't be able to use
parallelism in the execution of Initplan due to the restriction of
multiple levels of Gather node.  Another idea we discussed is that we
can evaluate the Initplans at Gather node if it is used as an external
param (plan->extParam) at or below the Gather node.

Based on that idea, I have modified the patch such that it will
compute the set of initplans Params that are required below gather
node and store them as bitmap of initplan params at gather node.
During set_plan_references, we can find the intersection of external
parameters that are required at Gather or nodes below it with the
initplans that are passed from same or above query level. Once the set
of initplan params are established, we evaluate those (if they are not
already evaluated) before execution of gather node and then pass the
computed value to each of the workers.   To identify whether a
particular param is parallel safe or not, we check if the paramid of
the param exists in initplans at same or above query level.  We don't
allow to generate gather path if there are initplans at some query
level below the current query level as those plans could be
parallel-unsafe or undirect correlated plans.

This restricts some of the cases for parallelism like when initplans
are below gather node, but the patch looks better. We can open up
those cases if required in a separate patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Kuntal Ghosh
Date:
On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Based on that idea, I have modified the patch such that it will
> compute the set of initplans Params that are required below gather
> node and store them as bitmap of initplan params at gather node.
> During set_plan_references, we can find the intersection of external
> parameters that are required at Gather or nodes below it with the
> initplans that are passed from same or above query level. Once the set
> of initplan params are established, we evaluate those (if they are not
> already evaluated) before execution of gather node and then pass the
> computed value to each of the workers.   To identify whether a
> particular param is parallel safe or not, we check if the paramid of
> the param exists in initplans at same or above query level.  We don't
> allow to generate gather path if there are initplans at some query
> level below the current query level as those plans could be
> parallel-unsafe or undirect correlated plans.

I would like to mention different test scenarios with InitPlans that
we've considered while developing and testing of the patch.

An InitPlan is a subselect that doesn't take any reference from its
immediate outer query level and it returns a param value. For example,
consider the following query:

          QUERY PLAN
------------------------------
 Seq Scan on t1
   Filter: (k = $0)
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Aggregate
           ->  Seq Scan on t3
In this case, the InitPlan is evaluated once when the filter is
checked for the first time. For subsequent checks, we need not
evaluate the initplan again since we already have the value. In our
approach, we parallelize the sequential scan by inserting a Gather
node on top of parallel sequential scan node. At the Gather node, we
evaluate the InitPlan before spawning the workers and pass this value
to the worker using dynamic shared memory. This yields the following
plan:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   InitPlan 1 (returns $0)
       ->  Aggregate
           ->  Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
As Amit mentioned up in the thread, at a Gather node, we evaluate only
those InitPlans that are attached to this query level or any higher
one and are used under the Gather node. extParam at a node includes
the InitPlan params that should be passed from an outer node. I've
attached a patch to show extParams and allParams for each node. Here
is the output with that patch:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 2
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
         allParams: $0
         extParams: $0
In this case, $0 is included in extParam of parallel sequential scan
and the InitPlan corresponding to this param is attached to the same
query level that contains the Gather node. Hence, we evaluate $0 at
Gather and pass it to workers.

But, for generating a plan like this requires marking an InitPlan
param as parallel_safe. We can't mark all params as parallel_safe
because of correlated subselects. Hence, in
max_parallel_hazard_walker, the only params marked safe are InitPlan
params from current or outer query level. An InitPlan param from inner
query level isn't marked safe since we can't evaluate this param at
any Gather node above the current node(where the param is used). As
mentioned by Amit, we also don't allow generation of gather path if
there are InitPlans at some query level below the current query level
as those plans could be parallel-unsafe or undirect correlated plans.

I've attached a script file and its output containing several
scenarios relevant to InitPlans. I've also attached the patch for
displaying extParam and allParam at each node. This patch can be
applied on top of pq_pushdown_initplan_v3.patch. Please find the
attachments.


> This restricts some of the cases for parallelism like when initplans
> are below gather node, but the patch looks better. We can open up
> those cases if required in a separate patch.
+1. Unfortunately, this patch doesn't enable parallelism for all
possible cases with InitPlans. Our objective is to keep things simple
and clean. Still, TPC-H q22 runs 2.5~3 times faster with this patch.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: parallelize queries containing initplans

From
Amit Kapila
Date:
On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Based on that idea, I have modified the patch such that it will
>> compute the set of initplans Params that are required below gather
>> node and store them as bitmap of initplan params at gather node.
>> During set_plan_references, we can find the intersection of external
>> parameters that are required at Gather or nodes below it with the
>> initplans that are passed from same or above query level. Once the set
>> of initplan params are established, we evaluate those (if they are not
>> already evaluated) before execution of gather node and then pass the
>> computed value to each of the workers.   To identify whether a
>> particular param is parallel safe or not, we check if the paramid of
>> the param exists in initplans at same or above query level.  We don't
>> allow to generate gather path if there are initplans at some query
>> level below the current query level as those plans could be
>> parallel-unsafe or undirect correlated plans.
>
> I would like to mention different test scenarios with InitPlans that
> we've considered while developing and testing of the patch.
>

Thanks a lot Kuntal for sharing different test scenarios.
Unfortunately, this patch doesn't received any review till now, so
there is no chance of making it in to PostgreSQL-10.  I have moved
this to next CF.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Based on that idea, I have modified the patch such that it will
>>> compute the set of initplans Params that are required below gather
>>> node and store them as bitmap of initplan params at gather node.
>>> During set_plan_references, we can find the intersection of external
>>> parameters that are required at Gather or nodes below it with the
>>> initplans that are passed from same or above query level. Once the set
>>> of initplan params are established, we evaluate those (if they are not
>>> already evaluated) before execution of gather node and then pass the
>>> computed value to each of the workers.   To identify whether a
>>> particular param is parallel safe or not, we check if the paramid of
>>> the param exists in initplans at same or above query level.  We don't
>>> allow to generate gather path if there are initplans at some query
>>> level below the current query level as those plans could be
>>> parallel-unsafe or undirect correlated plans.
>>
>> I would like to mention different test scenarios with InitPlans that
>> we've considered while developing and testing of the patch.
>>
>
> Thanks a lot Kuntal for sharing different test scenarios.
> Unfortunately, this patch doesn't received any review till now, so
> there is no chance of making it in to PostgreSQL-10.  I have moved
> this to next CF.
>

Attached is a rebased version of the patch with below changes:
a. SubplanState now directly stores Subplan rather than ExprState, so
patch needs some adjustment in that regard.
b. While rejecting the paths (based on if there are initplans at level
below the current query level) for parallelism, the rejected paths
were not marked as parallel unsafe.  Due to this in
force_parallel_mode=regress, we were able to add gather node above
parallel unsafe paths.  The modified patch ensures to mark such paths
as parallel unsafe.
c. Added regression test.
d. Improve comments in the code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Haribabu Kommi
Date:


On Mon, Jul 17, 2017 at 10:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Based on that idea, I have modified the patch such that it will
>>> compute the set of initplans Params that are required below gather
>>> node and store them as bitmap of initplan params at gather node.
>>> During set_plan_references, we can find the intersection of external
>>> parameters that are required at Gather or nodes below it with the
>>> initplans that are passed from same or above query level. Once the set
>>> of initplan params are established, we evaluate those (if they are not
>>> already evaluated) before execution of gather node and then pass the
>>> computed value to each of the workers.   To identify whether a
>>> particular param is parallel safe or not, we check if the paramid of
>>> the param exists in initplans at same or above query level.  We don't
>>> allow to generate gather path if there are initplans at some query
>>> level below the current query level as those plans could be
>>> parallel-unsafe or undirect correlated plans.
>>
>> I would like to mention different test scenarios with InitPlans that
>> we've considered while developing and testing of the patch.
>>
>
> Thanks a lot Kuntal for sharing different test scenarios.
> Unfortunately, this patch doesn't received any review till now, so
> there is no chance of making it in to PostgreSQL-10.  I have moved
> this to next CF.
>

Attached is a rebased version of the patch with below changes:
a. SubplanState now directly stores Subplan rather than ExprState, so
patch needs some adjustment in that regard.
b. While rejecting the paths (based on if there are initplans at level
below the current query level) for parallelism, the rejected paths
were not marked as parallel unsafe.  Due to this in
force_parallel_mode=regress, we were able to add gather node above
parallel unsafe paths.  The modified patch ensures to mark such paths
as parallel unsafe.
c. Added regression test.
d. Improve comments in the code.


I tested the latest patch and the parallel plan is getting choose for most of
the init plans.

For the following query the parallel plan is not chosen. The query contains
an init plan that refer the outer node. 

postgres=# explain analyze select * from t1 where t1.i in (select t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual time=8.335..132.557 rows=2 loops=1)
   Filter: (SubPlan 2)
   Rows Removed by Filter: 894
   SubPlan 2
     ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual time=0.146..0.146 rows=0 loops=896)
           One-Time Filter: (t1.k = $1)
           InitPlan 1 (returns $1)
             ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual time=0.145..0.145 rows=1 loops=896)
                   ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4) (actual time=0.131..0.144 rows=0 loops=896)
                         Filter: (i = t1.i)
                         Rows Removed by Filter: 900
           ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual time=0.012..0.013 rows=10 loops=2)
 Planning time: 0.272 ms
 Execution time: 132.623 ms
(14 rows)

If I change the query a little bit, the Result node doesn't appear and the parallel plan
gets chosen.

postgres=# explain analyze select * from t1 where t1.i in (select t2.i from t2 where t2.k = (select max(k) from t3 where t3.i=t1.i));
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..19162.88 rows=448 width=12) (actual time=3501.483..3501.483 rows=0 loops=1)
   Filter: (SubPlan 2)
   Rows Removed by Filter: 896
   SubPlan 2
     ->  Gather  (cost=16.27..26.47 rows=2 width=4) (actual time=3.471..3.795 rows=0 loops=896)
           Workers Planned: 2
           Params Evaluated: $1
           Workers Launched: 2
           InitPlan 1 (returns $1)
             ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual time=0.161..0.161 rows=1 loops=896)
                   ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4) (actual time=0.144..0.156 rows=0 loops=896)
                         Filter: (i = t1.i)
                         Rows Removed by Filter: 900
           ->  Parallel Seq Scan on t2  (cost=0.00..10.20 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=804608)
                 Filter: (k = $1)
                 Rows Removed by Filter: 1
 Planning time: 0.480 ms
 Execution time: 3502.016 ms
(18 rows)

I didn't check the code why the plan is not getting chosen.
Just shared it for your reference, whether it is a known already.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] parallelize queries containing initplans

From
Kuntal Ghosh
Date:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>
Thanks for testing.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>
> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
An observation is that the filter at Result node can't be pushed down
to the sequential scan on t2 because the filter is on t1. So, it has
to scan the complete t2 relation and send all the tuple to upper node,
a worst case for parallelism. Probably, this is the reason the
optimizer doesn't pick parallel plan for the above case.

Just for clarification, do you see any changes in the plan after
forcing parallelism(parallel_tuple_cost, parallel_setup_cost,
min_parallel_table_scan_size=0)?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Jul 17, 2017 at 10:53 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Tue, Mar 28, 2017 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
>> > <kuntalghosh.2007@gmail.com> wrote:
>> >> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >>> Based on that idea, I have modified the patch such that it will
>> >>> compute the set of initplans Params that are required below gather
>> >>> node and store them as bitmap of initplan params at gather node.
>> >>> During set_plan_references, we can find the intersection of external
>> >>> parameters that are required at Gather or nodes below it with the
>> >>> initplans that are passed from same or above query level. Once the set
>> >>> of initplan params are established, we evaluate those (if they are not
>> >>> already evaluated) before execution of gather node and then pass the
>> >>> computed value to each of the workers.   To identify whether a
>> >>> particular param is parallel safe or not, we check if the paramid of
>> >>> the param exists in initplans at same or above query level.  We don't
>> >>> allow to generate gather path if there are initplans at some query
>> >>> level below the current query level as those plans could be
>> >>> parallel-unsafe or undirect correlated plans.
>> >>
>> >> I would like to mention different test scenarios with InitPlans that
>> >> we've considered while developing and testing of the patch.
>> >>
>> >
>> > Thanks a lot Kuntal for sharing different test scenarios.
>> > Unfortunately, this patch doesn't received any review till now, so
>> > there is no chance of making it in to PostgreSQL-10.  I have moved
>> > this to next CF.
>> >
>>
>> Attached is a rebased version of the patch with below changes:
>> a. SubplanState now directly stores Subplan rather than ExprState, so
>> patch needs some adjustment in that regard.
>> b. While rejecting the paths (based on if there are initplans at level
>> below the current query level) for parallelism, the rejected paths
>> were not marked as parallel unsafe.  Due to this in
>> force_parallel_mode=regress, we were able to add gather node above
>> parallel unsafe paths.  The modified patch ensures to mark such paths
>> as parallel unsafe.
>> c. Added regression test.
>> d. Improve comments in the code.
>>
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>

Thanks for looking into this patch.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>

We don't want to generate the parallel plan for such cases.  Whenever
initplan refers to any outer node (aka correlated plan), it won't
generate a parallel plan.  Also, for t2, it doesn't choose a parallel
plan because one-time filter refers to the outer node (again
correlated plan case). Basically, till now we don't support parallel
plan for any case where the correlated plan is used.  So, it is
perfectly valid that it doesn't use parallel plan here.

> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
> If I change the query a little bit, the Result node doesn't appear and the
> parallel plan
> gets chosen.
>

This is a valid case for choosing a parallel plan.

> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t2.k = (select max(k) from t3 where t3.i=t1.i));
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..19162.88 rows=448 width=12) (actual
> time=3501.483..3501.483 rows=0 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 896
>    SubPlan 2
>      ->  Gather  (cost=16.27..26.47 rows=2 width=4) (actual
> time=3.471..3.795 rows=0 loops=896)
>            Workers Planned: 2
>            Params Evaluated: $1
>            Workers Launched: 2
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.161..0.161 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.144..0.156 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Parallel Seq Scan on t2  (cost=0.00..10.20 rows=1 width=4)
> (actual time=0.001..0.001 rows=0 loops=804608)
>                  Filter: (k = $1)
>                  Rows Removed by Filter: 1
>  Planning time: 0.480 ms
>  Execution time: 3502.016 ms
> (18 rows)
>

Here if you notice the parallel node t2 refers to the initplan which
can be parallelised after this patch.  Basically, whenever the
initplan is attached at or above Gather node, we compute its value and
pass down to workers.

> I didn't check the code why the plan is not getting chosen.
> Just shared it for your reference, whether it is a known already.
>

Yeah, it is known the behavior of the patch.

By the way, the patch doesn't apply on HEAD, so attached rebased patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Haribabu Kommi
Date:


On Wed, Aug 9, 2017 at 8:54 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> I tested the latest patch and the parallel plan is getting choose for most
> of
> the init plans.
>
Thanks for testing.

> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>
> postgres=# explain analyze select * from t1 where t1.i in (select t2.i from
> t2 where t1.k = (select max(k) from t3 where t3.i=t1.i));
>                                                      QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..22426.28 rows=448 width=12) (actual
> time=8.335..132.557 rows=2 loops=1)
>    Filter: (SubPlan 2)
>    Rows Removed by Filter: 894
>    SubPlan 2
>      ->  Result  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.146..0.146 rows=0 loops=896)
>            One-Time Filter: (t1.k = $1)
>            InitPlan 1 (returns $1)
>              ->  Aggregate  (cost=16.25..16.27 rows=1 width=4) (actual
> time=0.145..0.145 rows=1 loops=896)
>                    ->  Seq Scan on t3  (cost=0.00..16.25 rows=2 width=4)
> (actual time=0.131..0.144 rows=0 loops=896)
>                          Filter: (i = t1.i)
>                          Rows Removed by Filter: 900
>            ->  Seq Scan on t2  (cost=16.27..31.26 rows=999 width=4) (actual
> time=0.012..0.013 rows=10 loops=2)
>  Planning time: 0.272 ms
>  Execution time: 132.623 ms
> (14 rows)
>
An observation is that the filter at Result node can't be pushed down
to the sequential scan on t2 because the filter is on t1. So, it has
to scan the complete t2 relation and send all the tuple to upper node,
a worst case for parallelism. Probably, this is the reason the
optimizer doesn't pick parallel plan for the above case.

Just for clarification, do you see any changes in the plan after
forcing parallelism(parallel_tuple_cost, parallel_setup_cost,
min_parallel_table_scan_size=0)?

There is no plan change with parallel* GUC changes.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] parallelize queries containing initplans

From
Haribabu Kommi
Date:


On Wed, Aug 9, 2017 at 9:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 9, 2017 at 10:24 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> For the following query the parallel plan is not chosen. The query contains
> an init plan that refer the outer node.
>

We don't want to generate the parallel plan for such cases.  Whenever
initplan refers to any outer node (aka correlated plan), it won't
generate a parallel plan.  Also, for t2, it doesn't choose a parallel
plan because one-time filter refers to the outer node (again
correlated plan case). Basically, till now we don't support parallel
plan for any case where the correlated plan is used.  So, it is
perfectly valid that it doesn't use parallel plan here.

Thanks for providing the details.
 
Here if you notice the parallel node t2 refers to the initplan which
can be parallelised after this patch.  Basically, whenever the
initplan is attached at or above Gather node, we compute its value and
pass down to workers.

Thanks for the details. I checked the code also.

By the way, I tested the patch with by DML support for parallel patch to
check the returning of clause of insert, and all the returning clause init plans
are parallel plans with this patch.
 
By the way, the patch doesn't apply on HEAD, so attached rebased patch.


Thanks for the updated patch. I have some comments and I am yet to finish
the review.

+ /*
+ * We don't want to generate gather or gather merge node if there are
+ * initplans at some query level below the current query level as those
+ * plans could be parallel-unsafe or undirect correlated plans.  Ensure to
+ * mark all the partial and non-partial paths for a relation at this level
+ * to be parallel-unsafe.
+ */
+ if (is_initplan_below_current_query_level(root))
+ {
+ foreach(lc, rel->partial_pathlist)
+ {
+ Path   *subpath = (Path *) lfirst(lc);
+
+ subpath->parallel_safe = false;
+ }
+
+ foreach(lc, rel->pathlist)
+ {
+ Path   *subpath = (Path *) lfirst(lc);
+
+ subpath->parallel_safe = false;
+ }
+ return;
+ }
+

The above part of the code is almost same in mutiple places, is it possible
to change to function?


+ node->initParam = NULL;

This new parameter is set to NULL in make_gather function, the same parameter
is added to GatherMerge structure also, but anyway this parameter is set to NULL
makeNode macro, why only setting it to here, but not the other place. 

Do we need to set it to default value such as NULL or false if it is already the same value?
This is not related to the above parameter, for all existing parameters also.


+ if (IsA(plan, Gather))
+ ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else if (IsA(plan, GatherMerge))
+ ((GatherMerge *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else
+ elog(ERROR, "unrecognized node type: %d", nodeTag(plan));

The else case is not possible, because it is already validated for Gather or GatherMerge.
Can we change it simple if and else?

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Aug 9, 2017 at 6:51 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Aug 9, 2017 at 9:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
> By the way, I tested the patch with by DML support for parallel patch to
> check the returning of clause of insert, and all the returning clause init
> plans
> are parallel plans with this patch.
>

Good to know.

>>
>> By the way, the patch doesn't apply on HEAD, so attached rebased patch.
>
>
>
> Thanks for the updated patch. I have some comments and I am yet to finish
> the review.
>
> + /*
> + * We don't want to generate gather or gather merge node if there are
> + * initplans at some query level below the current query level as those
> + * plans could be parallel-unsafe or undirect correlated plans.  Ensure to
> + * mark all the partial and non-partial paths for a relation at this level
> + * to be parallel-unsafe.
> + */
> + if (is_initplan_below_current_query_level(root))
> + {
> + foreach(lc, rel->partial_pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> +
> + foreach(lc, rel->pathlist)
> + {
> + Path   *subpath = (Path *) lfirst(lc);
> +
> + subpath->parallel_safe = false;
> + }
> + return;
> + }
> +
>
> The above part of the code is almost same in mutiple places, is it possible
> to change to function?
>

Sure, we can do that and I think that makes sense as well, so changed
accordingly in the attached patch.

>
> + node->initParam = NULL;
>
> This new parameter is set to NULL in make_gather function, the same
> parameter
> is added to GatherMerge structure also, but anyway this parameter is set to
> NULL
> makeNode macro, why only setting it to here, but not the other place.
>
> Do we need to set it to default value such as NULL or false if it is already
> the same value?
> This is not related to the above parameter, for all existing parameters
> also.
>

Strictly speaking, it is not required, but I have initialised at only
one of the place to make it consistent to near by code.  We already do
similar stuff in some other functions like make_seqscan,
make_samplescan, ..

I don't see any value in trying to initialize it for GatherMerge as
well, so left it as it is.

>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else if (IsA(plan, GatherMerge))
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
> + else
> + elog(ERROR, "unrecognized node type: %d", nodeTag(plan));
>
> The else case is not possible, because it is already validated for Gather or
> GatherMerge.
> Can we change it simple if and else?
>

As we already have an assert in this function to protect from any
other node type (nodes other than Gather and Gather Merge), it makes
sense to change the code to just if...else, so changed accordingly.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Haribabu Kommi
Date:


On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 9, 2017 at 6:51 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else if (IsA(plan, GatherMerge))
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
> + else
> + elog(ERROR, "unrecognized node type: %d", nodeTag(plan));
>
> The else case is not possible, because it is already validated for Gather or
> GatherMerge.
> Can we change it simple if and else?
>

As we already have an assert in this function to protect from any
other node type (nodes other than Gather and Gather Merge), it makes
sense to change the code to just if...else, so changed accordingly.

Thanks for the updated patch. Patch looks fine. I just have some
minor comments.

+ * ExecEvalParamExecParams
+ *
+ * Execute the subplan stored in PARAM_EXEC initplans params, if not executed
+ * till now.
+ */
+void
+ExecEvalParamExecParams(Bitmapset *params, EState *estate)

I feel it is better to explain when this function executes the sub plans that are
not executed till now? Means like in what scenario?


+ if (params == NULL)
+ nparams = 0;
+ else
+ nparams = bms_num_members(params);

bms_num_members return 0 in case if the params is NULL.
Is it fine to keep the specific check for NULL is required for performance benefit
or just remove it? Anything is fine for me.


+ if (IsA(plan, Gather))
+ ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);
+ else
+ ((GatherMerge *) plan)->initParam = bms_intersect(plan->lefttree->extParam, initSetParam);


I think the above code is to find out the common parameters that are prsent in the external
and out params. It may be better to explain the logic in the comments.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Sun, Aug 13, 2017 at 6:49 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
> Thanks for the updated patch. Patch looks fine. I just have some
> minor comments.
>
> + * ExecEvalParamExecParams
> + *
> + * Execute the subplan stored in PARAM_EXEC initplans params, if not
> executed
> + * till now.
> + */
> +void
> +ExecEvalParamExecParams(Bitmapset *params, EState *estate)
>
> I feel it is better to explain when this function executes the sub plans
> that are
> not executed till now? Means like in what scenario?
>

It just means that it will execute the same initplan (subplan) just
once in master backend even if it used in multiple places.  This is
the same kind of usage as we have in ExecEvalParamExec.  You can find
its usage by using some query like

explain analyse select sum(t1.i) from t1, t2 where t1.j=t2.j and t1.k
= (select count(k) from t3) and t1.k=t2.k;

Ensure you insert some rows in t1 and t2 that match the count from t3.
If you are using the schema and data given in Kuntal's script in email
above, then you need to insert something like
t1(900,900,900);t2(900,900,900);

It generates plan like below:

postgres=# explain analyse select sum(t1.i) from t1, t2 where
t1.j=t2.j and t1.k = (select count(k) from t3) and t1.k=t2.k;
                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=29.65..29.66 rows=1 width=8) (actual
time=22572.521..22572.521 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate  (cost=9.70..9.71 rows=1 width=8) (actual
time=4345.110..4345.111 rows=1 loops=1)
           ->  Gather  (cost=9.69..9.70 rows=2 width=8) (actual
time=4285.019..4345.098 rows=3 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Partial Aggregate  (cost=9.69..9.70 rows=1
width=8) (actual time=0.154..0.155 rows=1 loops=3)
                       ->  Parallel Seq Scan on t3  (cost=0.00..8.75
rows=375 width=4) (actual time=0.011..0.090 rows=300 loops=3)
   ->  Nested Loop  (cost=0.00..19.93 rows=1 width=4) (actual
time=22499.918..22572.512 rows=1 loops=1)
         Join Filter: (t1.j = t2.j)
         ->  Gather  (cost=0.00..9.67 rows=2 width=12) (actual
time=10521.356..10521.363 rows=1 loops=1)
               Workers Planned: 2
               Params Evaluated: $0
               Workers Launched: 2
               ->  Parallel Seq Scan on t1  (cost=0.00..9.67 rows=1
width=12) (actual time=0.506..0.507 rows=0 loops=3)
                     Filter: (k = $0)
                     Rows Removed by Filter: 299
         ->  Materialize  (cost=0.00..10.21 rows=2 width=8) (actual
time=11978.557..12051.142 rows=1 loops=1)
               ->  Gather  (cost=0.00..10.20 rows=2 width=8) (actual
time=11978.530..12051.113 rows=1 loops=1)
                     Workers Planned: 2
                     Params Evaluated: $0
                     Workers Launched: 2
                     ->  Parallel Seq Scan on t2  (cost=0.00..10.20
rows=1 width=8) (actual time=0.067..0.067 rows=0 loops=3)
                           Filter: (k = $0)
                           Rows Removed by Filter: 333
 Planning time: 15103.237 ms
 Execution time: 22574.703 ms
(27 rows)

You can notice that initplan is used at multiple nodes, but it will be
evaluated just once.  If you want, I can add a sentence in the
comments, but I think this is somewhat obvious and the same use case
already exists.  Let me know if you still think that comments need to
be expanded?

>
> + if (params == NULL)
> + nparams = 0;
> + else
> + nparams = bms_num_members(params);
>
> bms_num_members return 0 in case if the params is NULL.
> Is it fine to keep the specific check for NULL is required for performance
> benefit
> or just remove it? Anything is fine for me.
>

I don't see any performance benefit here, so removed the if check.

>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
>
>
> I think the above code is to find out the common parameters that are prsent
> in the external
> and out params.
>

Here, we want to save all the initplan params that can be used below
the gather node.  extParam contains the set of all external PARAM_EXEC
params that can be used below gather node and we just want initplan
params out of those.

> It may be better to explain the logic in the comments.
>

I have kept comments atop of the function set_param_references to
explain the context, but now I have added few more in the code as
suggested by you.  See if that suffices the need.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Haribabu Kommi
Date:


On Mon, Aug 14, 2017 at 8:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Aug 13, 2017 at 6:49 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> On Fri, Aug 11, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
> Thanks for the updated patch. Patch looks fine. I just have some
> minor comments.
>
> + * ExecEvalParamExecParams
> + *
> + * Execute the subplan stored in PARAM_EXEC initplans params, if not
> executed
> + * till now.
> + */
> +void
> +ExecEvalParamExecParams(Bitmapset *params, EState *estate)
>
> I feel it is better to explain when this function executes the sub plans
> that are
> not executed till now? Means like in what scenario?
>

It just means that it will execute the same initplan (subplan) just
once in master backend even if it used in multiple places.  This is
the same kind of usage as we have in ExecEvalParamExec.  You can find
its usage by using some query like

explain analyse select sum(t1.i) from t1, t2 where t1.j=t2.j and t1.k
= (select count(k) from t3) and t1.k=t2.k;

Ensure you insert some rows in t1 and t2 that match the count from t3.
If you are using the schema and data given in Kuntal's script in email
above, then you need to insert something like
t1(900,900,900);t2(900,900,900);

It generates plan like below:

postgres=# explain analyse select sum(t1.i) from t1, t2 where
t1.j=t2.j and t1.k = (select count(k) from t3) and t1.k=t2.k;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=29.65..29.66 rows=1 width=8) (actual
time=22572.521..22572.521 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate  (cost=9.70..9.71 rows=1 width=8) (actual
time=4345.110..4345.111 rows=1 loops=1)
           ->  Gather  (cost=9.69..9.70 rows=2 width=8) (actual
time=4285.019..4345.098 rows=3 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Partial Aggregate  (cost=9.69..9.70 rows=1
width=8) (actual time=0.154..0.155 rows=1 loops=3)
                       ->  Parallel Seq Scan on t3  (cost=0.00..8.75
rows=375 width=4) (actual time=0.011..0.090 rows=300 loops=3)
   ->  Nested Loop  (cost=0.00..19.93 rows=1 width=4) (actual
time=22499.918..22572.512 rows=1 loops=1)
         Join Filter: (t1.j = t2.j)
         ->  Gather  (cost=0.00..9.67 rows=2 width=12) (actual
time=10521.356..10521.363 rows=1 loops=1)
               Workers Planned: 2
               Params Evaluated: $0
               Workers Launched: 2
               ->  Parallel Seq Scan on t1  (cost=0.00..9.67 rows=1
width=12) (actual time=0.506..0.507 rows=0 loops=3)
                     Filter: (k = $0)
                     Rows Removed by Filter: 299
         ->  Materialize  (cost=0.00..10.21 rows=2 width=8) (actual
time=11978.557..12051.142 rows=1 loops=1)
               ->  Gather  (cost=0.00..10.20 rows=2 width=8) (actual
time=11978.530..12051.113 rows=1 loops=1)
                     Workers Planned: 2
                     Params Evaluated: $0
                     Workers Launched: 2
                     ->  Parallel Seq Scan on t2  (cost=0.00..10.20
rows=1 width=8) (actual time=0.067..0.067 rows=0 loops=3)
                           Filter: (k = $0)
                           Rows Removed by Filter: 333
 Planning time: 15103.237 ms
 Execution time: 22574.703 ms
(27 rows)

You can notice that initplan is used at multiple nodes, but it will be
evaluated just once.  If you want, I can add a sentence in the
comments, but I think this is somewhat obvious and the same use case
already exists.  Let me know if you still think that comments need to
be expanded?

Thanks for providing details. Yes it is clear to me.


>
> + if (IsA(plan, Gather))
> + ((Gather *) plan)->initParam = bms_intersect(plan->lefttree->extParam,
> initSetParam);
> + else
> + ((GatherMerge *) plan)->initParam =
> bms_intersect(plan->lefttree->extParam, initSetParam);
>
>
> I think the above code is to find out the common parameters that are prsent
> in the external
> and out params.
>

Here, we want to save all the initplan params that can be used below
the gather node.  extParam contains the set of all external PARAM_EXEC
params that can be used below gather node and we just want initplan
params out of those.

> It may be better to explain the logic in the comments.
>

I have kept comments atop of the function set_param_references to
explain the context, but now I have added few more in the code as
suggested by you.  See if that suffices the need.

Thanks for adding more details. It is easy to understand.

I marked the patch as ready for committer in the commitfest.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> Thanks for adding more details. It is easy to understand.
>
> I marked the patch as ready for committer in the commitfest.
>

Thank you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
> <kommi.haribabu@gmail.com> wrote:
>>
>>
>> Thanks for adding more details. It is easy to understand.
>>
>> I marked the patch as ready for committer in the commitfest.
>>

Rebased the patch.  The output of test case added by the patch is also
slightly changed because of the recent commit
7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
new test result is as expected.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Thu, Aug 31, 2017 at 11:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>>>
>>>
>>> Thanks for adding more details. It is easy to understand.
>>>
>>> I marked the patch as ready for committer in the commitfest.
>>>
>
> Rebased the patch.  The output of test case added by the patch is also
> slightly changed because of the recent commit
> 7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
> new test result is as expected.
>

The latest patch again needs to be rebased.  Find rebased patch
attached with this email.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Daniel Gustafsson
Date:
> On 15 Sep 2017, at 04:45, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 31, 2017 at 11:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Aug 21, 2017 at 2:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Mon, Aug 21, 2017 at 1:44 PM, Haribabu Kommi
>>> <kommi.haribabu@gmail.com> wrote:
>>>>
>>>>
>>>> Thanks for adding more details. It is easy to understand.
>>>>
>>>> I marked the patch as ready for committer in the commitfest.
>>
>> Rebased the patch.  The output of test case added by the patch is also
>> slightly changed because of the recent commit
>> 7df2c1f8daeb361133ac8bdeaf59ceb0484e315a.  I have verified that the
>> new test result is as expected.
>
> The latest patch again needs to be rebased.  Find rebased patch
> attached with this email.

Moved to next commitfest, but changed to Waiting for author since it no longer
applies cleanly.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Thu, Sep 14, 2017 at 10:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The latest patch again needs to be rebased.  Find rebased patch
> attached with this email.

I read through this patch this morning.   Copying Tom in the hopes
that he might chime in on the following two issues in particular:

1. Is there any superior alternative to adding ptype to ParamExecData?I think the reason why we don't have this already
isbecause the plan
 
tree that populates the Param must output the right type and the plan
tree that reads the Param must expect the right type, and there's no
need for anything else.  But for serialization and deserialization
this seems to be necessary.  I wonder whether it would be better to
try to capture this at the time the Param is generated (e.g.
var->vartype in assign_param_for_var) rather than derived at execution
time by applying exprType(), but I'm not sure how that would work
exactly, or whether it's really any better.

2. Do max_parallel_hazard_walker and set_param_references() have the
right idea about which parameters are acceptable candidates for this
optimization?  The idea seems to be to collect the setParam sets for
all initplans between the current query level and the root.  That
looks correct to me but I'm not an expert on this parameter stuff.

Some other comments:

- I think parallel_hazard_walker should likely work by setting
safe_param_ids to the right set of parameter IDs rather than testing
whether the parameter is safe by checking either whether it is in
safe_param_ids or some other condition is met.

- contains_parallel_unsafe_param() sounds like a function that merely
returns true or false, but it actually has major side effects.  Those
side effects also look unsafe; mutating previously-generated paths can
corrupt the rel's pathlist, because it will no longer have the sort
order and other characteristics that add_path() creates and upon which
other code relies.

- Can't is_initplan_is_below_current_query_level() be confused when
there are multiple subqueries in the tree?  For example if the
toplevel query has subqueries a and b and a has a sub-subquery aa
which has an initplan, won't this function think that b has an
initplan below the current query level?  If not, maybe a comment is in
order explaining why not?

- Why do we even need contains_parallel_unsafe_param() and
is_initplan_is_below_current_query_level() in the first place, anyway?I think there's been some discussion of that on
thisthread, but I'm
 
not sure I completely understand it, and the comments in the patch
don't help me understand why we now need this restriction.

- The new code in ExplainPrintPlan() needs a comment.

- I have typically referred in comments to "Gather or Gather Merge"
rather than "gather or gather merge".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Mon, Oct 2, 2017 at 8:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 14, 2017 at 10:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The latest patch again needs to be rebased.  Find rebased patch
>> attached with this email.
>
> I read through this patch this morning.   Copying Tom in the hopes
> that he might chime in on the following two issues in particular:
>

It seems you forgot to include Tom, included now.

> 1. Is there any superior alternative to adding ptype to ParamExecData?
>  I think the reason why we don't have this already is because the plan
> tree that populates the Param must output the right type and the plan
> tree that reads the Param must expect the right type, and there's no
> need for anything else.  But for serialization and deserialization
> this seems to be necessary.  I wonder whether it would be better to
> try to capture this at the time the Param is generated (e.g.
> var->vartype in assign_param_for_var) rather than derived at execution
> time by applying exprType(), but I'm not sure how that would work
> exactly, or whether it's really any better.
>
> 2. Do max_parallel_hazard_walker and set_param_references() have the
> right idea about which parameters are acceptable candidates for this
> optimization?  The idea seems to be to collect the setParam sets for
> all initplans between the current query level and the root.  That
> looks correct to me but I'm not an expert on this parameter stuff.
>
> Some other comments:
>
> - I think parallel_hazard_walker should likely work by setting
> safe_param_ids to the right set of parameter IDs rather than testing
> whether the parameter is safe by checking either whether it is in
> safe_param_ids or some other condition is met.
>

Okay, I think it should work the way you are suggesting.  I will give it a try.

> - contains_parallel_unsafe_param() sounds like a function that merely
> returns true or false, but it actually has major side effects.  Those
> side effects also look unsafe; mutating previously-generated paths can
> corrupt the rel's pathlist, because it will no longer have the sort
> order and other characteristics that add_path() creates and upon which
> other code relies.
>
> - Can't is_initplan_is_below_current_query_level() be confused when
> there are multiple subqueries in the tree?  For example if the
> toplevel query has subqueries a and b and a has a sub-subquery aa
> which has an initplan, won't this function think that b has an
> initplan below the current query level?  If not, maybe a comment is in
> order explaining why not?
>

We can discuss above two points once you are convinced about whether
we need any such functions, so let's first discuss that.

> - Why do we even need contains_parallel_unsafe_param() and
> is_initplan_is_below_current_query_level() in the first place, anyway?
>  I think there's been some discussion of that on this thread, but I'm
> not sure I completely understand it, and the comments in the patch
> don't help me understand why we now need this restriction.
>

This is to ensure that initplans are never below Gather node.  If we
allow parallelism when initplan is below current query level, (a) then
we need a way to pass the result of initplan from worker to other
workers and to master backend (b) if initplan contains reference to
some parameter above the current query level (undirect correlated
plans), then we need a mechanism to pass such parameters (basically
allow correlated initplans work).

Now, one might think (a) and or (b) is desirable so that it can be
used in more number of cases, but based on TPC-H analysis we have
found that it is quite useful even without those and in fact after we
support those cases the benefits might not be significant.

The patch contains few comments in generate_gather_paths and at the
top of functions contains_parallel_unsafe_param and
is_initplan_is_below_current_query_level, if you feel it should be
explained in further detail, then let me know.


> - The new code in ExplainPrintPlan() needs a comment.
>

There was a comment, but I have added a somewhat detailed comment now,
check if that makes it clear.

> - I have typically referred in comments to "Gather or Gather Merge"
> rather than "gather or gather merge".
>

Changed as per suggestion.

Changed funcation name is_initplan_is_below_current_query_level to
is_initplan_below_current_query_level as well.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Tue, Oct 3, 2017 at 7:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> - Why do we even need contains_parallel_unsafe_param() and
>> is_initplan_is_below_current_query_level() in the first place, anyway?
>>  I think there's been some discussion of that on this thread, but I'm
>> not sure I completely understand it, and the comments in the patch
>> don't help me understand why we now need this restriction.
>
> This is to ensure that initplans are never below Gather node.  If we
> allow parallelism when initplan is below current query level, (a) then
> we need a way to pass the result of initplan from worker to other
> workers and to master backend (b) if initplan contains reference to
> some parameter above the current query level (undirect correlated
> plans), then we need a mechanism to pass such parameters (basically
> allow correlated initplans work).

Man, this stuff makes my hurt.  Leaving aside the details of what this
particular patch tries to handle, and with due regard for the
possibility that I don't know what I'm talking about, I feel like
there are four cases: (1) the InitPlan refers to one or more
parameters whose value is computed below the Gather node, (2) the
InitPlan refers to one or more parameters whose value is computed
above the Gather node, (3) the InitPlan refers to both parameters
computed above the Gather and parameters computed below the Gather,
and (4) the InitPlan refers to no parameters at all.

In cases (1) and (3), we cannot compute the value at the Gather node
because some of the necessary information doesn't exist yet.  it
wouldn't make any sense to try to compute it at that point because
each the value on which the InitPlan output depends can be different
in each worker and can change at different times in different workers.
However, in case (1), it's definitely OK to compute the value in the
worker itself, and in case (3), it's OK provided that the necessary
parameter values from above the Gather are passed down to the workers.

In cases (2) and (4), we can compute the value at the Gather node.  If
we tried to compute it within individual workers, then we'd run into a
few problems: (a) the plan might not be parallel-safe, (b) it would be
inefficient to recompute the same value in every worker, and (c) if
the expression is volatile, the workers might not all arrive at the
same answer.  Now, if the plan is parallel-safe, then we could also
choose NOT to compute the value at the Gather node and instead let the
first participant that needs the value compute it; while it's being
computed, it could fence off other participants from embarking on the
same computation, and once the computation is done, all participants
get the same computed value.  But AFAICS all of that is strictly
optional: in any case where the InitPlan doesn't depend a value
produced below the Gather, it is legal (implementation difficulties
aside) to compute it at the Gather node.

Having said all that, I think that this patch only wants to handle the
subset of cases (2) and (4) where the relevant InitPlan is attached
ABOVE the Gather node -- which seems very reasonable, because
evaluating an InitPlan at a level of the plan tree above the level at
which it is defined sounds like it might be complex.  But I still
don't quite see why we need these tests.  I mean, if we only allow
Param references from a set of safe parameter IDs, and the safe
parameter IDs include only those IDs that can be generated in a
worker, then won't other InitPlans in the workers anyway be ruled out?I assume there won't be an InitPlan whose output
Paramis never
 
referenced anywhere -- and if there were, it wouldn't really matter
whether a worker thought it could compute it, because there'd be
nothing to trigger that computation to actually happen.

If I am all mixed up, please help straighten me out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 4, 2017 at 3:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 3, 2017 at 7:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Having said all that, I think that this patch only wants to handle the
> subset of cases (2) and (4) where the relevant InitPlan is attached
> ABOVE the Gather node -- which seems very reasonable, because
> evaluating an InitPlan at a level of the plan tree above the level at
> which it is defined sounds like it might be complex.  But I still
> don't quite see why we need these tests.  I mean, if we only allow
> Param references from a set of safe parameter IDs, and the safe
> parameter IDs include only those IDs that can be generated in a
> worker, then won't other InitPlans in the workers anyway be ruled out?

It doesn't happen always.  There are cases when the part of required
conditions are pushed one query level below, so when we check during
max_parallel_hazard_walker, they look safe, but actually, they are
not.  I will try to explain by example:

postgres=# explain (costs off, verbose) select * from t1 where t1.i in
( select 1 + (select max(j) from t3));                             QUERY PLAN
----------------------------------------------------------------------Hash Semi Join  Output: t1.i, t1.j, t1.k  Hash
Cond:(t1.i = ((1 + $1)))  ->  Seq Scan on public.t1        Output: t1.i, t1.j, t1.k  ->  Hash        Output: ((1 + $1))
      ->  Result              Output: (1 + $1)              InitPlan 1 (returns $1)                ->  Finalize
Aggregate                     Output: max(t3.j)                      ->  Gather                            Output:
(PARTIALmax(t3.j))                            Workers Planned: 2                            ->  Partial Aggregate
                          Output: PARTIAL max(t3.j)                                  ->  Parallel Seq Scan on public.t3
                                      Output: t3.j
 
(19 rows)

In the above example, you can see that the condition referring to
initplan (1 + $1) is pushed one level below.  So when it tries to
check parallel safety for the join condition, it won't see Param node.
Now, consider if we don't check contains_parallel_unsafe_param during
generate_gather_paths, then it will lead to below plan.


postgres=# explain (costs off, verbose) select * from t1 where t1.i in
( select 1 + (select max(j) from t3));                                QUERY PLAN
----------------------------------------------------------------------------Gather  Output: t1.i, t1.j, t1.k  Workers
Planned:2  ->  Hash Semi Join        Output: t1.i, t1.j, t1.k        Hash Cond: (t1.i = ((1 + $1)))        ->  Parallel
SeqScan on public.t1              Output: t1.i, t1.j, t1.k        ->  Hash              Output: ((1 + $1))
-> Result                    Output: (1 + $1)                    InitPlan 1 (returns $1)                      ->
FinalizeAggregate                            Output: max(t3.j)                            ->  Gather
             Output: (PARTIAL max(t3.j))                                  Workers Planned: 2
     ->  Partial Aggregate                                        Output: PARTIAL max(t3.j)
          ->  Parallel Seq Scan on public.t3                                              Output: t3.j
 
(22 rows)

This is wrong because when we will try to evaluate params that are
required at gather node, we won't get the required param as there is
no initplan at that level.

>
> If I am all mixed up, please help straighten me out.
>

I think whatever you said is right and very clear.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 4, 2017 at 12:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 4, 2017 at 3:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Oct 3, 2017 at 7:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Having said all that, I think that this patch only wants to handle the
>> subset of cases (2) and (4) where the relevant InitPlan is attached
>> ABOVE the Gather node -- which seems very reasonable, because
>> evaluating an InitPlan at a level of the plan tree above the level at
>> which it is defined sounds like it might be complex.  But I still
>> don't quite see why we need these tests.  I mean, if we only allow
>> Param references from a set of safe parameter IDs, and the safe
>> parameter IDs include only those IDs that can be generated in a
>> worker, then won't other InitPlans in the workers anyway be ruled out?
>
> It doesn't happen always.  There are cases when the part of required
> conditions are pushed one query level below, so when we check during
> max_parallel_hazard_walker, they look safe, but actually, they are
> not.  I will try to explain by example:
>
> postgres=# explain (costs off, verbose) select * from t1 where t1.i in
> ( select 1 + (select max(j) from t3));

If you want to reproduce this case, then you can use the script posted
by Kuntal up thread [1].

[1] - https://www.postgresql.org/message-id/CAGz5QC%2BuHOq78GCika3fbgRyN5zgiDR9Dd1Th5kENF%2BUpnPomQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 4, 2017 at 12:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 4, 2017 at 3:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Oct 3, 2017 at 7:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Having said all that, I think that this patch only wants to handle the
>> subset of cases (2) and (4) where the relevant InitPlan is attached
>> ABOVE the Gather node -- which seems very reasonable, because
>> evaluating an InitPlan at a level of the plan tree above the level at
>> which it is defined sounds like it might be complex.  But I still
>> don't quite see why we need these tests.  I mean, if we only allow
>> Param references from a set of safe parameter IDs, and the safe
>> parameter IDs include only those IDs that can be generated in a
>> worker, then won't other InitPlans in the workers anyway be ruled out?
>
> It doesn't happen always.  There are cases when the part of required
> conditions are pushed one query level below, so when we check during
> max_parallel_hazard_walker, they look safe, but actually, they are
> not.  I will try to explain by example:
>
> postgres=# explain (costs off, verbose) select * from t1 where t1.i in
> ( select 1 + (select max(j) from t3));
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  Hash Semi Join
>    Output: t1.i, t1.j, t1.k
>    Hash Cond: (t1.i = ((1 + $1)))
>    ->  Seq Scan on public.t1
>          Output: t1.i, t1.j, t1.k
>    ->  Hash
>          Output: ((1 + $1))
>          ->  Result
>                Output: (1 + $1)
>                InitPlan 1 (returns $1)
>                  ->  Finalize Aggregate
>                        Output: max(t3.j)
>                        ->  Gather
>                              Output: (PARTIAL max(t3.j))
>                              Workers Planned: 2
>                              ->  Partial Aggregate
>                                    Output: PARTIAL max(t3.j)
>                                    ->  Parallel Seq Scan on public.t3
>                                          Output: t3.j
> (19 rows)
>
> In the above example, you can see that the condition referring to
> initplan (1 + $1) is pushed one level below.  So when it tries to
> check parallel safety for the join condition, it won't see Param node.
>

To add more detail, the param node in join qual is replaced with Var
during pullup of sublink.  Basically, it constructs vars from target
list of subquery and then replaces it in join quals.  Refer code:
convert_ANY_sublink_to_join()
{
..
/*
* Build a list of Vars representing the subselect outputs.
*/
subquery_vars = generate_subquery_vars(root, subselect->targetList, rtindex);

/*
* Build the new join's qual expression, replacing Params with these Vars.
*/
quals = convert_testexpr(root, sublink->testexpr, subquery_vars);
..
}

Now, unless, I am missing something here, it won't be possible to
detect params in such cases during forming of join rels and hence we
need the tests in generate_gather_paths.  Let me know if I am missing
something in this context or if you have any better ideas to make it
work?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Thu, Oct 5, 2017 at 5:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now, unless, I am missing something here, it won't be possible to
> detect params in such cases during forming of join rels and hence we
> need the tests in generate_gather_paths.  Let me know if I am missing
> something in this context or if you have any better ideas to make it
> work?

Hmm, in a case like this, I think we shouldn't need to detect it.  The
Var is perfectly parallel-safe, the problem is that we can't use a
not-safe plan for the inner rel.  I wonder why that's happening
here... is it a problem with your patch or an existing bug?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Thu, Oct 5, 2017 at 6:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 5:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Now, unless, I am missing something here, it won't be possible to
>> detect params in such cases during forming of join rels and hence we
>> need the tests in generate_gather_paths.  Let me know if I am missing
>> something in this context or if you have any better ideas to make it
>> work?
>
> Hmm, in a case like this, I think we shouldn't need to detect it.  The
> Var is perfectly parallel-safe, the problem is that we can't use a
> not-safe plan for the inner rel.  I wonder why that's happening
> here...
>

It is because the inner rel (Result Path) contains a reference to a
param which appears to be at the same query level.  Basically due to
changes in max_parallel_hazard_walker().

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Thu, Oct 5, 2017 at 1:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 6:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 5, 2017 at 5:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> Now, unless, I am missing something here, it won't be possible to
>>> detect params in such cases during forming of join rels and hence we
>>> need the tests in generate_gather_paths.  Let me know if I am missing
>>> something in this context or if you have any better ideas to make it
>>> work?
>>
>> Hmm, in a case like this, I think we shouldn't need to detect it.  The
>> Var is perfectly parallel-safe, the problem is that we can't use a
>> not-safe plan for the inner rel.  I wonder why that's happening
>> here...
>
> It is because the inner rel (Result Path) contains a reference to a
> param which appears to be at the same query level.  Basically due to
> changes in max_parallel_hazard_walker().

I spent several hours debugging this issue this afternoon.  I think
you've misdiagnosed the problem.  I think that the Param reference in
the result path is parallel-safe; that doesn't seem to me to be wrong.
If we see a Param reference for our own query level, then either we're
below the Gather and the new logic added by this patch will pass down
the value or we're above the Gather and we can access the value
directly.  Either way, no problem.

However, I think that if you attach an InitPlan to a parallel-safe
plan, it becomes parallel-restricted.  This is obvious in the case
where the InitPlan's plan isn't itself parallel-safe, but it's also
arguably true even when the InitPlan's plan *is* parallel-safe,
because pushing that below a Gather introduces a multiple-evaluation
hazard.  I think we should fix that problem someday by providing a
DSA-based parameter store, but that's a job for another day.

And it turns out that we actually have such logic already, but this
patch removes it:

@@ -2182,7 +2181,6 @@ SS_charge_for_initplans(PlannerInfo *root,
RelOptInfo *final_rel)
               path->startup_cost += initplan_cost;               path->total_cost += initplan_cost;
-               path->parallel_safe = false;       }
       /* We needn't do set_cheapest() here, caller will do it */

Now, the header comment for SS_charge_for_initplans() is wrong; it
claims we can't transmit initPlans to workers, but I think that's
already wrong even before this patch.  On the other hand, I think that
the actual code is right even after this patch.  If I put that line
back but make contains_parallel_unsafe_param always return false, then
I can still get plans like this (I modified EXPLAIN to show Parallel
Safe markings)...

rhaas=# explain select * from pgbench_accounts where bid = (select
max(g) from generate_series(1,1000)g);                                      QUERY PLAN
-----------------------------------------------------------------------------------------Gather  (cost=12.51..648066.51
rows=100000width=97)  Parallel Safe: false  Workers Planned: 2  Params Evaluated: $0  InitPlan 1 (returns $0)    ->
Aggregate (cost=12.50..12.51 rows=1 width=4)          Parallel Safe: true          ->  Function Scan on generate_series
g (cost=0.00..10.00
 
rows=1000 width=4)                Parallel Safe: true  ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..648054.00
rows=41667 width=97)        Parallel Safe: true        Filter: (bid = $0)
(12 rows)

...but Kuntal's example no longer misbehaves:
                             QUERY PLAN
----------------------------------------------------------------------Hash Semi Join  Parallel Safe: false  Output:
t1.i,t1.j, t1.k  Hash Cond: (t1.i = ((1 + $1)))  ->  Gather        Parallel Safe: false        Output: t1.i, t1.j, t1.k
      Workers Planned: 2        ->  Parallel Seq Scan on public.t1              Parallel Safe: true
Output:t1.i, t1.j, t1.k  ->  Hash        Parallel Safe: false        Output: ((1 + $1))        ->  Result
ParallelSafe: false              Output: (1 + $1)              InitPlan 1 (returns $1)                ->  Finalize
Aggregate                     Parallel Safe: false                      Output: max(t3.j)                      ->
Gather                           Parallel Safe: false                            Output: (PARTIAL max(t3.j))
               Workers Planned: 2                            ->  Partial Aggregate
ParallelSafe: true                                  Output: PARTIAL max(t3.j)                                  ->
ParallelSeq Scan on public.t3                                        Parallel Safe: true
       Output: t3.j
 
(31 rows)

With your original path, the Result was getting marked parallel-safe,
but now it doesn't, which is correct, and after that everything seems
to just work.

Notice that in my original example the topmost plan node doubly fails
to be parallel-safe: it fails because it's a Gather, and it fails
because it has an InitPlan attached.  But that's all OK -- the
InitPlan doesn't make anything *under* the Gather unsafe, so we can
still use parallelism *at* the level where the InitPlan is attached,
just not *above* that level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Fri, Oct 6, 2017 at 2:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 1:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Oct 5, 2017 at 6:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Oct 5, 2017 at 5:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> Now, unless, I am missing something here, it won't be possible to
>>>> detect params in such cases during forming of join rels and hence we
>>>> need the tests in generate_gather_paths.  Let me know if I am missing
>>>> something in this context or if you have any better ideas to make it
>>>> work?
>>>
>>> Hmm, in a case like this, I think we shouldn't need to detect it.  The
>>> Var is perfectly parallel-safe, the problem is that we can't use a
>>> not-safe plan for the inner rel.  I wonder why that's happening
>>> here...
>>
>> It is because the inner rel (Result Path) contains a reference to a
>> param which appears to be at the same query level.  Basically due to
>> changes in max_parallel_hazard_walker().
>
> I spent several hours debugging this issue this afternoon.
>

Thanks a lot for spending time.

>  I think
> you've misdiagnosed the problem.  I think that the Param reference in
> the result path is parallel-safe; that doesn't seem to me to be wrong.
> If we see a Param reference for our own query level, then either we're
> below the Gather and the new logic added by this patch will pass down
> the value or we're above the Gather and we can access the value
> directly.  Either way, no problem.
>
> However, I think that if you attach an InitPlan to a parallel-safe
> plan, it becomes parallel-restricted.  This is obvious in the case
> where the InitPlan's plan isn't itself parallel-safe, but it's also
> arguably true even when the InitPlan's plan *is* parallel-safe,
> because pushing that below a Gather introduces a multiple-evaluation
> hazard.  I think we should fix that problem someday by providing a
> DSA-based parameter store, but that's a job for another day.
>
> And it turns out that we actually have such logic already, but this
> patch removes it:
>
> @@ -2182,7 +2181,6 @@ SS_charge_for_initplans(PlannerInfo *root,
> RelOptInfo *final_rel)
>
>                 path->startup_cost += initplan_cost;
>                 path->total_cost += initplan_cost;
> -               path->parallel_safe = false;
>         }
>
>         /* We needn't do set_cheapest() here, caller will do it */
>

Yeah, it is a mistake from my side.  I thought that we don't need it
now as we pass the computed value of initplan params for valid cases
and for other cases we can prohibit them as was done in the patch.

> Now, the header comment for SS_charge_for_initplans() is wrong; it
> claims we can't transmit initPlans to workers, but I think that's
> already wrong even before this patch.
>

What exactly you want to convey as part of that comment?

I have fixed the other review comment related to using safe_param_list
in the attached patch.  I think I have fixed all comments given by
you, but let me know if I have missed anything or you have any other
comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Fri, Oct 6, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have fixed the other review comment related to using safe_param_list
> in the attached patch.  I think I have fixed all comments given by
> you, but let me know if I have missed anything or you have any other
> comment.

-        Param       *param = (Param *) node;
+        if (list_member_int(context->safe_param_ids, ((Param *)
node)->paramid))
+            return false;

-        if (param->paramkind != PARAM_EXEC ||
-            !list_member_int(context->safe_param_ids, param->paramid))
-        {
-            if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
-                return true;
-        }
-        return false;            /* nothing to recurse to */
+        if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
+            return true;

I think your revised logic is wrong here because it drops the
paramkind test, and I don't see any justification for that.  IIUC,
param IDs are tracked separately for extern and exec parameters, so
why would we ignore that distinction here?

But I'm also wondering if we're missing a trick here, because isn't a
PARAM_EXTERN parameter always safe, given SerializeParamList?  If so,
this || ought to be &&, but if that adjustment is needed, it seems
like a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Sat, Oct 7, 2017 at 7:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have fixed the other review comment related to using safe_param_list
>> in the attached patch.  I think I have fixed all comments given by
>> you, but let me know if I have missed anything or you have any other
>> comment.
>
> -        Param       *param = (Param *) node;
> +        if (list_member_int(context->safe_param_ids, ((Param *)
> node)->paramid))
> +            return false;
>
> -        if (param->paramkind != PARAM_EXEC ||
> -            !list_member_int(context->safe_param_ids, param->paramid))
> -        {
> -            if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> -                return true;
> -        }
> -        return false;            /* nothing to recurse to */
> +        if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> +            return true;
>
> I think your revised logic is wrong here because it drops the
> paramkind test, and I don't see any justification for that.
>

I have dropped the check thinking that Param extern params will be
always safe and for param exec params we now have a list of safe
params, so we don't need this check and now again thinking about it,
it seems I might not be right.  OTOH, I am not sure if the check as
written is correct for all cases, however, I think for the purpose of
this patch, we can leave it as it is and discuss separately what
should be the check (as suggested by you).  I have reverted the check
in the attached patch.

>
> But I'm also wondering if we're missing a trick here, because isn't a
> PARAM_EXTERN parameter always safe, given SerializeParamList?
>

Right.

>  If so,
> this || ought to be &&, but if that adjustment is needed, it seems
> like a separate patch.
>

How will it work if, during planning, we encounter param_extern param
in any list?  Won't it return true in that case, because param extern
params will not be present in safe_param_ids, so this check will be
true and then max_parallel_hazard_test will also return true?

How about always returning false for PARAM_EXTERN?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
tushar
Date:
On 10/09/2017 03:26 PM, Amit Kapila wrote:
> I have reverted the check
> in the attached patch.

I have applied this patch against PG HEAD and run sqlsmith and analyzed 
results . didn't find any specific failures against this patch.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Mon, Oct 9, 2017 at 5:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> How about always returning false for PARAM_EXTERN?

Yeah, I think that's what we should do.  Let's do that first as a
separate patch, which we might even want to back-patch, then return to
this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 11, 2017 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 9, 2017 at 5:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> How about always returning false for PARAM_EXTERN?
>
> Yeah, I think that's what we should do.  Let's do that first as a
> separate patch, which we might even want to back-patch, then return to
> this.
>

Sure, I have started a new thread as this fix lead to some other
failures.  I hope that is okay.

https://www.postgresql.org/message-id/CAA4eK1%2B_BuZrmVCeua5Eqnm4Co9DAXdM5HPAOE2J19ePbR912Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
tushar
Date:
On 10/11/2017 12:42 PM, tushar wrote:
> On 10/09/2017 03:26 PM, Amit Kapila wrote:
>> I have reverted the check
>> in the attached patch.
>
> I have applied this patch against PG HEAD and run sqlsmith and 
> analyzed results . didn't find any specific failures against this patch.
>
I did some testing of  this feature and written few testcases . PFA the 
sql  file(along with the expected .out file) .

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 11, 2017 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 9, 2017 at 5:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> How about always returning false for PARAM_EXTERN?
>
> Yeah, I think that's what we should do.  Let's do that first as a
> separate patch, which we might even want to back-patch, then return to
> this.
>

Now that the PARAM_EXTERN issue is fixed, I have rebased this patch.
This patch had been switched to Ready For Committer in last CF, then
Robert had comments which I have addressed, so I think the status
should be switched back to Ready For committer.  Let me know if you
think it should be switched to some other status.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Oct 18, 2017 at 2:06 PM, tushar <tushar.ahuja@enterprisedb.com> wrote:
> On 10/11/2017 12:42 PM, tushar wrote:
>>
>> On 10/09/2017 03:26 PM, Amit Kapila wrote:
>>>
>>> I have reverted the check
>>> in the attached patch.
>>
>>
>> I have applied this patch against PG HEAD and run sqlsmith and analyzed
>> results . didn't find any specific failures against this patch.
>>
> I did some testing of  this feature and written few testcases . PFA the sql
> file(along with the expected .out file) .
>

Thanks a lot Tushar for testing this patch.  In the latest patch, I
have just rebased some comments, there is no code change, so I don't
expect any change in behavior, but feel free to test it once again.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Mon, Oct 30, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Now that the PARAM_EXTERN issue is fixed, I have rebased this patch.
> This patch had been switched to Ready For Committer in last CF, then
> Robert had comments which I have addressed, so I think the status
> should be switched back to Ready For committer.  Let me know if you
> think it should be switched to some other status.

The change to ExplainPrintPlan doesn't look good to me, because it
actually moves the initPlan; I don't think it's good for EXPLAIN to
mutate the plan state tree.  It should find a way to display the
results *as if* the initPlans were attached to the subnode, but
without actually moving them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
tushar
Date:
On 10/30/2017 09:02 AM, Amit Kapila wrote:
> Thanks a lot Tushar for testing this patch.  In the latest patch, I
> have just rebased some comments, there is no code change, so I don't
> expect any change in behavior, but feel free to test it once again.

Thanks Amit. Sure.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
tushar
Date:
On 10/30/2017 01:36 PM, tushar wrote:
> On 10/30/2017 09:02 AM, Amit Kapila wrote:
>> Thanks a lot Tushar for testing this patch.  In the latest patch, I
>> have just rebased some comments, there is no code change, so I don't
>> expect any change in behavior, but feel free to test it once again.
>
> Thanks Amit. Sure.
>
I have done re-verification ,Everything looks good.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Mon, Oct 30, 2017 at 10:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 30, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Now that the PARAM_EXTERN issue is fixed, I have rebased this patch.
>> This patch had been switched to Ready For Committer in last CF, then
>> Robert had comments which I have addressed, so I think the status
>> should be switched back to Ready For committer.  Let me know if you
>> think it should be switched to some other status.
>
> The change to ExplainPrintPlan doesn't look good to me, because it
> actually moves the initPlan; I don't think it's good for EXPLAIN to
> mutate the plan state tree.  It should find a way to display the
> results *as if* the initPlans were attached to the subnode, but
> without actually moving them.
>


Actually, with the latest patch, we don't need these changes in
ExplainPrintPlan.   Earlier, we need these changes because the patch
had changed SS_charge_for_initplans to mark the path with initplan as
parallel safe.  However, after removing that change in the previous
version of patch [1], this is not required as now we won't add gather
on top plan node having initplan.


[1] - https://www.postgresql.org/message-id/CAA4eK1JD%3DpJYBn8rN5RimiEVtPJmVNmyq5p6VoZBnUw2xRYB7w%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Mon, Oct 30, 2017 at 9:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Oct 9, 2017 at 5:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> How about always returning false for PARAM_EXTERN?
>>
>> Yeah, I think that's what we should do.  Let's do that first as a
>> separate patch, which we might even want to back-patch, then return to
>> this.
>>
>
> Now that the PARAM_EXTERN issue is fixed, I have rebased this patch.
> This patch had been switched to Ready For Committer in last CF, then
> Robert had comments which I have addressed, so I think the status
> should be switched back to Ready For committer.
>

As mentioned, changed the status of the patch in CF app.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Tue, Nov 7, 2017 at 4:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> As mentioned, changed the status of the patch in CF app.

I spent some time reviewing this patch today and found myself still
quite uncomfortable with the fact that it was adding execution-time
work to track the types of parameters - types that would usually not
even be used.  I found the changes to nodeNestLoop.c to be
particularly objectionable, because we could end up doing the work
over and over when it is actually not needed at all, or at most once.
I decided to try instead teaching the planner to keep track of the
types of PARAM_EXEC parameters as they were created, and that seems to
work fine.  See 0001, attached.

0002, attached, is my worked-over version of the rest of the patch.  I
moved the code that serializes and deserializes PARAM_EXEC from
nodeSubplan.c -- which seemed like a strange choice - to
execParallel.c.  I removed the type OID from the serialization format
because there's no reason to do that any more; the worker already
knows the types from the plan.  I did some renaming of the functions
involved and some adjustment of the comments to refer to "PARAM_EXEC
parameters" instead of initPlan parameters, because there's no reason
that I can see why this can only work for initPlans.  A Gather node on
the inner side of a nested loop doesn't sound like a great idea, but I
think this infrastructure could handle it (though it would need some
more planner work).  I broke a lot of long lines in your version of
the patch into multiple lines; please try to be attentive to this
issue when writing patches in general, as it is a bit tedious to go
through and insert line breaks in many places.

Please let me know your thoughts on the attached patches.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I decided to try instead teaching the planner to keep track of the
> types of PARAM_EXEC parameters as they were created, and that seems to
> work fine.  See 0001, attached.

I did not look at the other part, but 0001 looks reasonable to me.
I might quibble with the grammar in the generate_new_param comment:

- * need to record the PARAM_EXEC slot number as being allocated.
+ * need to make sure we have record the type in paramExecTypes (otherwise,
+ * there won't be a slot allocated for it). */

I'd just go with "need to record the type in ..."

Also, I wonder whether the InvalidOid hack in SS_assign_special_param
requires commentary.  It might be safer to use a valid type OID there,
perhaps VOIDOID or INTERNALOID.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Fri, Nov 10, 2017 at 2:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I decided to try instead teaching the planner to keep track of the
>> types of PARAM_EXEC parameters as they were created, and that seems to
>> work fine.  See 0001, attached.
>
> I did not look at the other part, but 0001 looks reasonable to me.

Thanks for looking.

> I might quibble with the grammar in the generate_new_param comment:
>
> - * need to record the PARAM_EXEC slot number as being allocated.
> + * need to make sure we have record the type in paramExecTypes (otherwise,
> + * there won't be a slot allocated for it).
>   */
>
> I'd just go with "need to record the type in ..."

Noted.

> Also, I wonder whether the InvalidOid hack in SS_assign_special_param
> requires commentary.  It might be safer to use a valid type OID there,
> perhaps VOIDOID or INTERNALOID.

I think it's pretty straightforward -- if, as the existing comments
say, no Param node will be present and no value will be stored, then
we do not and cannot record the type of the value that we're not
storing.

There is existing precedent for using InvalidOid to denote the absence
of a parameter -- cf. copyParamList, SerializeParamList.  That
convention was not invented by me or the parallel query stuff,
although it has become more widespread for that reason.  I am
disinclined to have this patch invent a New Way To Do It.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Nov 10, 2017 at 2:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, I wonder whether the InvalidOid hack in SS_assign_special_param
>> requires commentary.  It might be safer to use a valid type OID there,
>> perhaps VOIDOID or INTERNALOID.

> There is existing precedent for using InvalidOid to denote the absence
> of a parameter -- cf. copyParamList, SerializeParamList.

OK, fair enough.  I was concerned that this was adding a corner case
not otherwise seen with Params, but evidently not.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Sat, Nov 11, 2017 at 12:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 4:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> As mentioned, changed the status of the patch in CF app.
>
> I spent some time reviewing this patch today and found myself still
> quite uncomfortable with the fact that it was adding execution-time
> work to track the types of parameters - types that would usually not
> even be used.  I found the changes to nodeNestLoop.c to be
> particularly objectionable, because we could end up doing the work
> over and over when it is actually not needed at all, or at most once.
>

That's right, but we are just accessing tuple descriptor to get the
type, there shouldn't be much work involved in that.  However, I think
your approach has a merit that we don't need to even do that during
execution time.

> I decided to try instead teaching the planner to keep track of the
> types of PARAM_EXEC parameters as they were created, and that seems to
> work fine.  See 0001, attached.
>

This looks good to me.

> 0002, attached, is my worked-over version of the rest of the patch.  I
> moved the code that serializes and deserializes PARAM_EXEC from
> nodeSubplan.c -- which seemed like a strange choice - to
> execParallel.c.
>

I have tried to follow the practice we have used for param extern
params (SerializeParamList is in params.c) and most of the handling of
initplans is done in nodeSubplan.c, so I choose to keep the newly
added functions there.  However, I think keeping it in execParallel.c
is also okay as we do it for serialize plan.

>  I removed the type OID from the serialization format
> because there's no reason to do that any more; the worker already
> knows the types from the plan.  I did some renaming of the functions
> involved and some adjustment of the comments to refer to "PARAM_EXEC
> parameters" instead of initPlan parameters, because there's no reason
> that I can see why this can only work for initPlans.  A Gather node on
> the inner side of a nested loop doesn't sound like a great idea, but I
> think this infrastructure could handle it (though it would need some
> more planner work).
>

I think it would need some work in execution as well because the size
won't be fixed in that case for varchar type of params.  We might end
up with something different as well.

>  I broke a lot of long lines in your version of
> the patch into multiple lines; please try to be attentive to this
> issue when writing patches in general, as it is a bit tedious to go
> through and insert line breaks in many places.
>

Okay, but I sometimes rely on pgindent for such things as for few
things it becomes difficult to decide which way it will be better.

> Please let me know your thoughts on the attached patches.
>

Few minor comments:
1.
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -23,6 +23,7 @@

 #include "postgres.h"

+#include "executor/execExpr.h"
 #include "executor/execParallel.h"
 #include "executor/executor.h"
 #include "executor/nodeBitmapHeapscan.h"
@@ -31,6 +32,7 @@
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeSubplan.h"

This is not required if we move serialize and other functions to execParallel.c

2.
+set_param_references(PlannerInfo *root, Plan *plan)
+{
+ Assert(IsA(plan, Gather) ||IsA(plan, GatherMerge));

I think there should be a space after || operator.

3.
+/*
+ * Serialize ParamExecData params corresponding to initplans.
+ *
..
+/*
+ * Restore ParamExecData params corresponding to initplans.
+ */

Shouldn't we change the reference to initplans here as well?

I have fixed the first two in attached patch and left the last one as
I was not sure what you have in mind

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Sat, Nov 11, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have tried to follow the practice we have used for param extern
> params (SerializeParamList is in params.c) and most of the handling of
> initplans is done in nodeSubplan.c, so I choose to keep the newly
> added functions there.  However, I think keeping it in execParallel.c
> is also okay as we do it for serialize plan.

To me it feels like there is only a very loose coupling between this
code and what's currently in nodeSubplan.c, but maybe I'm wrong.  I am
also not sure execParallel.c is the perfect place either; we might
eventually split execParallel.c into multiple files if it accumulates
too many things that are too different from each other.

> I think it would need some work in execution as well because the size
> won't be fixed in that case for varchar type of params.  We might end
> up with something different as well.

Hmm, true.  But don't we also have that problem with the patch as
written?  I mean, didn't we determine at some point that the value of
an InitPlan can change, if the initplan depends on a parameter that is
correlated but not directly correlated?  The existence of
ExecReScanSetParamPlan() seems to suggest that this is indeed
possible, and that means that the parameter could be reevaluated and
evaluate, on the second time through, to a value that requires more
storage than before.  That would be a problem, because
ExecParallelReInitializeDSM reuses the old DSM.

>>  I broke a lot of long lines in your version of
>> the patch into multiple lines; please try to be attentive to this
>> issue when writing patches in general, as it is a bit tedious to go
>> through and insert line breaks in many places.
>
> Okay, but I sometimes rely on pgindent for such things as for few
> things it becomes difficult to decide which way it will be better.

pgindent doesn't in general break long lines.  There are a few cases
where it does, like reflowing comments.  But mostly you have to do
that manually.  For instance, if you write a =
verylongfunctionname(longargument(thing), stuff(thing), foobar(thing,
thing, thing)) it's going to keep all that on one line.  If you insert
line breaks between the arguments it will keep them, though.  I think
it's worth doing something like:

git diff | awk 'length($0)>81'

on your patches before you submit them.  If you see things in there
that can be reformatted to make the long lines shorter, do it.

> I have fixed the first two in attached patch and left the last one as
> I was not sure what you have in mind

Oh, yeah, that should be changed too, something like "Serialize
PARAM_EXEC parameters."  Sorry, I thought I caught all of the
references to initplans specifically, but I guess I missed a few.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Tue, Nov 14, 2017 at 2:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 11, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have tried to follow the practice we have used for param extern
>> params (SerializeParamList is in params.c) and most of the handling of
>> initplans is done in nodeSubplan.c, so I choose to keep the newly
>> added functions there.  However, I think keeping it in execParallel.c
>> is also okay as we do it for serialize plan.
>
> To me it feels like there is only a very loose coupling between this
> code and what's currently in nodeSubplan.c, but maybe I'm wrong.  I am
> also not sure execParallel.c is the perfect place either; we might
> eventually split execParallel.c into multiple files if it accumulates
> too many things that are too different from each other.
>

Another possibility is subselect.c, because it contains initplan
stuff, however, most of the things are related to planning, so not
sure.  I think nodeSubplan.c won't be a bad choice as it is mentioned
in the file header that it is concerned about initplan stuff.

>> I think it would need some work in execution as well because the size
>> won't be fixed in that case for varchar type of params.  We might end
>> up with something different as well.
>
> Hmm, true.  But don't we also have that problem with the patch as
> written?  I mean, didn't we determine at some point that the value of
> an InitPlan can change, if the initplan depends on a parameter that is
> correlated but not directly correlated?  The existence of
> ExecReScanSetParamPlan() seems to suggest that this is indeed
> possible, and that means that the parameter could be reevaluated and
> evaluate, on the second time through, to a value that requires more
> storage than before.  That would be a problem, because
> ExecParallelReInitializeDSM reuses the old DSM.
>

I think this would have been a matter of concern if we use initplans
below Gather/Gather Merge.  The patch uses initplans which are between
current query level and root.  So, I think we don't need to reevaluate
such parameters.  Let us try to see it via example:

                                      QUERY PLAN
----------------------------------------------------------------------------------------Aggregate  (cost=62.08..62.08
rows=1width=8)  InitPlan 1 (returns $1)    ->  Finalize Aggregate  (cost=20.64..20.65 rows=1 width=8)          ->
Gather (cost=20.63..20.64 rows=2 width=8)                Workers Planned: 2                ->  Partial Aggregate
(cost=20.63..20.64rows=1 width=8)                      ->  Parallel Seq Scan on t3  (cost=0.00..18.50
 
rows=850 width=4)  ->  Hash Join  (cost=20.75..41.42 rows=1 width=4)        Hash Cond: (t1.j = t2.j)        ->  Gather
(cost=0.00..20.63rows=10 width=12)              Workers Planned: 2              Params Evaluated: $1              ->
ParallelSeq Scan on t1  (cost=0.00..20.63 rows=4 width=12)                    Filter: (k = $1)        ->  Hash
(cost=20.63..20.63rows=10 width=8)              ->  Gather  (cost=0.00..20.63 rows=10 width=8)
WorkersPlanned: 2                    Params Evaluated: $1                    ->  Parallel Seq Scan on t2
(cost=0.00..20.63
rows=4 width=8)                          Filter: (k = $1)
(20 rows)


Now, here even if initplan would have been an undirect correlated
plan, it wouldn't have been a problem, because we don't need to
reevaluate it for Gather node below Hash.

Am I missing something?  Do you have some test or shape of the plan in
mind which can cause a problem?

>>>  I broke a lot of long lines in your version of
>>> the patch into multiple lines; please try to be attentive to this
>>> issue when writing patches in general, as it is a bit tedious to go
>>> through and insert line breaks in many places.
>>
>> Okay, but I sometimes rely on pgindent for such things as for few
>> things it becomes difficult to decide which way it will be better.
>
> pgindent doesn't in general break long lines.  There are a few cases
> where it does, like reflowing comments.  But mostly you have to do
> that manually.  For instance, if you write a =
> verylongfunctionname(longargument(thing), stuff(thing), foobar(thing,
> thing, thing)) it's going to keep all that on one line.  If you insert
> line breaks between the arguments it will keep them, though.  I think
> it's worth doing something like:
>
> git diff | awk 'length($0)>81'
>
> on your patches before you submit them.  If you see things in there
> that can be reformatted to make the long lines shorter, do it.
>

Okay, point noted.

>> I have fixed the first two in attached patch and left the last one as
>> I was not sure what you have in mind
>
> Oh, yeah, that should be changed too, something like "Serialize
> PARAM_EXEC parameters."  Sorry, I thought I caught all of the
> references to initplans specifically, but I guess I missed a few.
>

No issues, I can go through it once more to change the comments
wherever initplans is used after we settle on the above discussion.
If we decide that initplan and other PARAM_EXEC params need different
treatment, then we might want to retain initplans in the comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Tue, Nov 14, 2017 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think this would have been a matter of concern if we use initplans
> below Gather/Gather Merge.  The patch uses initplans which are between
> current query level and root.  So, I think we don't need to reevaluate
> such parameters.  Let us try to see it via example:
>
>
>                                        QUERY PLAN
> ----------------------------------------------------------------------------------------
>  Aggregate  (cost=62.08..62.08 rows=1 width=8)
>    InitPlan 1 (returns $1)
>      ->  Finalize Aggregate  (cost=20.64..20.65 rows=1 width=8)
>            ->  Gather  (cost=20.63..20.64 rows=2 width=8)
>                  Workers Planned: 2
>                  ->  Partial Aggregate  (cost=20.63..20.64 rows=1 width=8)
>                        ->  Parallel Seq Scan on t3  (cost=0.00..18.50
> rows=850 width=4)
>    ->  Hash Join  (cost=20.75..41.42 rows=1 width=4)
>          Hash Cond: (t1.j = t2.j)
>          ->  Gather  (cost=0.00..20.63 rows=10 width=12)
>                Workers Planned: 2
>                Params Evaluated: $1
>                ->  Parallel Seq Scan on t1  (cost=0.00..20.63 rows=4 width=12)
>                      Filter: (k = $1)
>          ->  Hash  (cost=20.63..20.63 rows=10 width=8)
>                ->  Gather  (cost=0.00..20.63 rows=10 width=8)
>                      Workers Planned: 2
>                      Params Evaluated: $1
>                      ->  Parallel Seq Scan on t2  (cost=0.00..20.63
> rows=4 width=8)
>                            Filter: (k = $1)
> (20 rows)
>
>
> Now, here even if initplan would have been an undirect correlated
> plan, it wouldn't have been a problem, because we don't need to
> reevaluate it for Gather node below Hash.
>
> Am I missing something?  Do you have some test or shape of the plan in
> mind which can cause a problem?

The problem would happen if the plan for InitPlan $1 in the above
example itself referenced a parameter from an upper query level, and
the value of that parameter changed, and then this section of the plan
tree was rescanned.  I'm not sure I can write a query like that
off-hand, but I think it's possible to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Tue, Nov 14, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The problem would happen if the plan for InitPlan $1 in the above
> example itself referenced a parameter from an upper query level, and
> the value of that parameter changed, and then this section of the plan
> tree was rescanned.  I'm not sure I can write a query like that
> off-hand, but I think it's possible to do so.

OK, here's an example:

regression=# explain select * from tenk1 a where unique1 = (select
unique1 from tenk1 b where (select string4 from tenk1 c where c.ten =
a.ten order by unique1 limit 1) < b.string4 limit 1);                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------Seq Scan on
tenk1a  (cost=0.00..22051.31 rows=50 width=244)  Filter: (unique1 = (SubPlan 2))  SubPlan 2    ->  Limit
(cost=2.01..2.16rows=1 width=4)          InitPlan 1 (returns $1)            ->  Limit  (cost=0.29..2.01 rows=1
width=68)                 ->  Index Scan using tenk1_unique1 on tenk1 c
 
(cost=0.29..1727.20 rows=1000 width=68)                        Filter: (ten = a.ten)          ->  Seq Scan on tenk1 b
(cost=0.00..483.00rows=3333 width=4)                Filter: ($1 < string4)
 
(10 rows)

InitPlan 1 has got to be re-evaluated for every row in the "Seq Scan
on tenk1 a", and each iteration could return a different value for $1,
and some of those values might be wider than others -- well, not
really, because in this example string4 is actually declared as type
"name".  But if you imagine it as type "text" then you can see the
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] parallelize queries containing initplans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Nov 14, 2017 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Am I missing something?  Do you have some test or shape of the plan in
>> mind which can cause a problem?

> The problem would happen if the plan for InitPlan $1 in the above
> example itself referenced a parameter from an upper query level, and
> the value of that parameter changed, and then this section of the plan
> tree was rescanned.  I'm not sure I can write a query like that
> off-hand, but I think it's possible to do so.

Yeah, I'm sure it is.  I have a vague recollection that there might be
existing regression test cases exercising such things, though I won't
swear to that.  The "orderstest" bit in subselect.sql looks like it
might be meant to do that...
        regards, tom lane


Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Tue, Nov 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I'm sure it is.  I have a vague recollection that there might be
> existing regression test cases exercising such things, though I won't
> swear to that.  The "orderstest" bit in subselect.sql looks like it
> might be meant to do that...

Here's an updated patch that attempts to work around this problem using DSA.

It could use a test case that actually tickles the new logic in
ExecParallelReinitialize ... this is mostly just to show the concept.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Wed, Nov 15, 2017 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I'm sure it is.  I have a vague recollection that there might be
>> existing regression test cases exercising such things, though I won't
>> swear to that.  The "orderstest" bit in subselect.sql looks like it
>> might be meant to do that...
>

I agree that such cases can cause a problem with fixed memory.

> Here's an updated patch that attempts to work around this problem using DSA.
>

There were a couple of problems with your changes:
1.
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ dsa_handle param_exec; /* serialized PARAM_EXEC parameters */
@@ -35,12 +36,13 @@ typedef struct ParallelExecutorInfo
 } ParallelExecutorInfo;

This should be dsa_pointer, otherwise, the value returned by
SerializeParamExecParams will get truncated.

2. In ExecParallelReinitialize(), we need to evaluate the params
before serializing them.

3. I think we should free the dsa pointer at the end of the gather.

Attached patch fixes the mentioned problems.

> It could use a test case that actually tickles the new logic in
> ExecParallelReinitialize ... this is mostly just to show the concept.
>

Thanks, it was quite helpful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [HACKERS] parallelize queries containing initplans

From
Kuntal Ghosh
Date:
On Thu, Nov 16, 2017 at 3:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Nov 15, 2017 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Nov 14, 2017 at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Yeah, I'm sure it is.  I have a vague recollection that there might be
>>> existing regression test cases exercising such things, though I won't
>>> swear to that.  The "orderstest" bit in subselect.sql looks like it
>>> might be meant to do that...
>>
>
> I agree that such cases can cause a problem with fixed memory.
>
I'm able to come up with a test case to produce the problem.

regression[107494]=# select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.ten from tenk1 c where c.ten =
a.ten limit 1) = b.ten limit 1) group by a.ten;
                   QUERY PLAN
---------------------------------------------------
HashAggregate Group Key: a.ten ->  Seq Scan on tenk1 a       Filter: (SubPlan 2)       SubPlan 2         ->  Limit
        InitPlan 1 (returns $1)                 ->  Limit                       ->  Seq Scan on tenk1 c
           Filter: (ten = a.ten)               ->  Seq Scan on tenk1 b                     Filter: ($1 = ten)
 
In the above case, Subplan 2 returns the same value of a.ten that is
used in initplan as a reference. So, this query is nothing but select
ten,count(*) from tenk1 group by a.ten. The output should be 10 rows
each row having a count=1000.

By enforcing parallelism, I got the following plan:                     QUERY PLAN
------------------------------------------------------
HashAggregate Group Key: a.ten ->  Seq Scan on tenk1 a       Filter: (SubPlan 2)       SubPlan 2         ->  Limit
        InitPlan 1 (returns $1)                 ->  Limit                       ->  Seq Scan on tenk1 c
           Filter: (ten = a.ten)               ->  Gather                     Workers Planned: 2
ParamsEvaluated: $1                     ->  Parallel Seq Scan on tenk1 b                           Filter: ($1 = ten)
 

When I set parallel_leader_participation to off, I get the incorrect
result as follows:ten | count
------------  0  | 1000
(1 row)
In the above case, the initplan gets evaluated only once from
ExecInitParallelPlan path. Hence, we see the incorrect result.

>> Here's an updated patch that attempts to work around this problem using DSA.
>>
>
> There were a couple of problems with your changes:
> 1.
> BufferUsage *buffer_usage; /* points to bufusage area in DSM */
> + dsa_handle param_exec; /* serialized PARAM_EXEC parameters */
> @@ -35,12 +36,13 @@ typedef struct ParallelExecutorInfo
>  } ParallelExecutorInfo;
>
> This should be dsa_pointer, otherwise, the value returned by
> SerializeParamExecParams will get truncated.
>
> 2. In ExecParallelReinitialize(), we need to evaluate the params
> before serializing them.
>
> 3. I think we should free the dsa pointer at the end of the gather.
>
> Attached patch fixes the mentioned problems.
>
>> It could use a test case that actually tickles the new logic in
>> ExecParallelReinitialize ... this is mostly just to show the concept.
>>
>
> Thanks, it was quite helpful.
>
I've tested the above-mentioned scenario with this patch and it is
working fine. Also, I've created a text column named 'vartext',
inserted some random length texts(max length 100) and tweaked the
above query as follows:
select ten,count(*) from tenk1 a where a.ten in (select
b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
a.ten limit 1) = b.vartext limit 1) group by a.ten;
This query is equivalent to select ten,count(*) from tenk1 group by
a.ten. It also produced the expected result without throwing any
error.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallelize queries containing initplans

From
Robert Haas
Date:
On Thu, Nov 16, 2017 at 5:23 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> I've tested the above-mentioned scenario with this patch and it is
> working fine. Also, I've created a text column named 'vartext',
> inserted some random length texts(max length 100) and tweaked the
> above query as follows:
> select ten,count(*) from tenk1 a where a.ten in (select
> b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
> a.ten limit 1) = b.vartext limit 1) group by a.ten;
> This query is equivalent to select ten,count(*) from tenk1 group by
> a.ten. It also produced the expected result without throwing any
> error.

Great!  I have committed the patch; thanks for testing.

As I said in the commit message, there's a lot more work that could be
done here.  I think we should consider trying to revise this whole
system so that instead of serializing the values and passing them to
the workers, we allocate an array of slots where each slot has a Datum
flag, an isnull flag, and a dsa_pointer (which maybe could be union'd
to the Datum?).  If we're passing a parameter by value, we could just
store it in the Datum field; if it's null, we just set isnull.  If
it's being passed by reference, we dsa_allocate() space for it, copy
it into that space, and then store the dsa_pointer.

The advantage of this is that it would be possible to imagine the
contents of a slot changing while parallelism is running, which
doesn't really work with the current serialized-blob representation.
That would in turn allow us to imagine letting parallel-safe InitPlans
being evaluated by the first participant that needs the value rather
than before launching workers, which would be good, not only because
of the possibility of deferring work for InitPlans attached at or
above the Gather but also because it could be used for InitPlans below
the Gather (as long as they don't depend on any parameters computed
below the Gather).  I thought about trying to refactor this patch to
use a slot concept like what I just mentioned even before committing
it, but after poking at it a bit I decided that was going to require
too much new development to justify cramming it into the same patch.
Still, it seems like a good idea to do it at some point (when we have
lots of free time?).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] parallelize queries containing initplans

From
Amit Kapila
Date:
On Thu, Nov 16, 2017 at 10:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 5:23 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've tested the above-mentioned scenario with this patch and it is
>> working fine. Also, I've created a text column named 'vartext',
>> inserted some random length texts(max length 100) and tweaked the
>> above query as follows:
>> select ten,count(*) from tenk1 a where a.ten in (select
>> b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
>> a.ten limit 1) = b.vartext limit 1) group by a.ten;
>> This query is equivalent to select ten,count(*) from tenk1 group by
>> a.ten. It also produced the expected result without throwing any
>> error.
>
> Great!  I have committed the patch; thanks for testing.
>

Thanks.

> As I said in the commit message, there's a lot more work that could be
> done here.  I think we should consider trying to revise this whole
> system so that instead of serializing the values and passing them to
> the workers, we allocate an array of slots where each slot has a Datum
> flag, an isnull flag, and a dsa_pointer (which maybe could be union'd
> to the Datum?).  If we're passing a parameter by value, we could just
> store it in the Datum field; if it's null, we just set isnull.  If
> it's being passed by reference, we dsa_allocate() space for it, copy
> it into that space, and then store the dsa_pointer.
>
> The advantage of this is that it would be possible to imagine the
> contents of a slot changing while parallelism is running, which
> doesn't really work with the current serialized-blob representation.
> That would in turn allow us to imagine letting parallel-safe InitPlans
> being evaluated by the first participant that needs the value rather
> than before launching workers, which would be good, not only because
> of the possibility of deferring work for InitPlans attached at or
> above the Gather but also because it could be used for InitPlans below
> the Gather (as long as they don't depend on any parameters computed
> below the Gather).
>

That would be cool, but I think here finding whether it is dependent
on any parameter computed below gather could be tricky.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com