Re: [HACKERS] parallelize queries containing initplans - Mailing list pgsql-hackers

From Kuntal Ghosh
Subject Re: [HACKERS] parallelize queries containing initplans
Date
Msg-id CAGz5QC+uHOq78GCika3fbgRyN5zgiDR9Dd1Th5kENF+UpnPomQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] parallelize queries containing initplans  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: parallelize queries containing initplans  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Based on that idea, I have modified the patch such that it will
> compute the set of initplans Params that are required below gather
> node and store them as bitmap of initplan params at gather node.
> During set_plan_references, we can find the intersection of external
> parameters that are required at Gather or nodes below it with the
> initplans that are passed from same or above query level. Once the set
> of initplan params are established, we evaluate those (if they are not
> already evaluated) before execution of gather node and then pass the
> computed value to each of the workers.   To identify whether a
> particular param is parallel safe or not, we check if the paramid of
> the param exists in initplans at same or above query level.  We don't
> allow to generate gather path if there are initplans at some query
> level below the current query level as those plans could be
> parallel-unsafe or undirect correlated plans.

I would like to mention different test scenarios with InitPlans that
we've considered while developing and testing of the patch.

An InitPlan is a subselect that doesn't take any reference from its
immediate outer query level and it returns a param value. For example,
consider the following query:

          QUERY PLAN
------------------------------
 Seq Scan on t1
   Filter: (k = $0)
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Aggregate
           ->  Seq Scan on t3
In this case, the InitPlan is evaluated once when the filter is
checked for the first time. For subsequent checks, we need not
evaluate the initplan again since we already have the value. In our
approach, we parallelize the sequential scan by inserting a Gather
node on top of parallel sequential scan node. At the Gather node, we
evaluate the InitPlan before spawning the workers and pass this value
to the worker using dynamic shared memory. This yields the following
plan:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   InitPlan 1 (returns $0)
       ->  Aggregate
           ->  Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
As Amit mentioned up in the thread, at a Gather node, we evaluate only
those InitPlans that are attached to this query level or any higher
one and are used under the Gather node. extParam at a node includes
the InitPlan params that should be passed from an outer node. I've
attached a patch to show extParams and allParams for each node. Here
is the output with that patch:
                    QUERY PLAN
---------------------------------------------------
 Gather
   Workers Planned: 2
   Params Evaluated: $0
   allParams: $0
   InitPlan 1 (returns $0)
     ->  Finalize Aggregate
           ->  Gather
                 Workers Planned: 2
                 ->  Partial Aggregate
                       ->  Parallel Seq Scan on t3
   ->  Parallel Seq Scan on t1
         Filter: (k = $0)
         allParams: $0
         extParams: $0
In this case, $0 is included in extParam of parallel sequential scan
and the InitPlan corresponding to this param is attached to the same
query level that contains the Gather node. Hence, we evaluate $0 at
Gather and pass it to workers.

But, for generating a plan like this requires marking an InitPlan
param as parallel_safe. We can't mark all params as parallel_safe
because of correlated subselects. Hence, in
max_parallel_hazard_walker, the only params marked safe are InitPlan
params from current or outer query level. An InitPlan param from inner
query level isn't marked safe since we can't evaluate this param at
any Gather node above the current node(where the param is used). As
mentioned by Amit, we also don't allow generation of gather path if
there are InitPlans at some query level below the current query level
as those plans could be parallel-unsafe or undirect correlated plans.

I've attached a script file and its output containing several
scenarios relevant to InitPlans. I've also attached the patch for
displaying extParam and allParam at each node. This patch can be
applied on top of pq_pushdown_initplan_v3.patch. Please find the
attachments.


> This restricts some of the cases for parallelism like when initplans
> are below gather node, but the patch looks better. We can open up
> those cases if required in a separate patch.
+1. Unfortunately, this patch doesn't enable parallelism for all
possible cases with InitPlans. Our objective is to keep things simple
and clean. Still, TPC-H q22 runs 2.5~3 times faster with this patch.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] WIP: Faster Expression Processing v4
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] WIP: Faster Expression Processing v4