[HACKERS] parallelize queries containing subplans - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | [HACKERS] parallelize queries containing subplans |
Date | |
Msg-id | CAA4eK1+e8Z45D2n+rnDMDYsVEb5iW7jqaCH_tvPMYau=1Rru9w@mail.gmail.com Whole thread Raw |
Responses |
Re: [HACKERS] parallelize queries containing subplans
Re: [HACKERS] parallelize queries containing subplans Re: [HACKERS] parallelize queries containing subplans Re: [HACKERS] parallelize queries containing subplans |
List | pgsql-hackers |
Currently, queries that have references to SubPlans or AlternativeSubPlans are considered parallel-restricted. I think we can lift this restriction in many cases especially when SubPlans are parallel-safe. To make this work, we need to propagate the parallel-safety information from path node to plan node and the same could be easily done while creating a plan. Another option could be that instead of propagating parallel-safety information from path to plan, we can find out from the plan if it is parallel-safe (doesn't contain any parallel-aware node) by traversing whole plan tree, but I think it is a waste of cycles. Once we have parallel-safety information in the plan, we can use that for detection of parallel-safe expressions in max_parallel_hazard_walker(). Finally, we can pass all the subplans to workers during plan serialization in ExecSerializePlan(). This will enable workers to execute subplans that are referred in parallel part of the plan. Now, we might be able to optimize it such that we pass only subplans that are referred in parallel portion of plan, but I am not sure if it is worth the trouble because it is one-time cost and much lesser than other things we do (like creating dsm, launching workers). Attached patch implements the above idea. This will enable parallelism for queries containing un-correlated subplans, an example of which is as follows: set parallel_tuple_cost=0; set parallel_setup_cost=0; set min_parallel_relation_size=50; create table t1 (i int, j int, k int); create table t2 (i int, j int, k int); insert into t1 values (generate_series(1,10)*random(), generate_series(5,50)*random(), generate_series(8,80)*random()); insert into t2 values (generate_series(4,10)*random(), generate_series(5,90)*random(), generate_series(2,10)*random()); Plan without Patch ----------------------------- postgres=# explain select * from t1 where t1.i not in (select t2.i from t2 where t2.i in (1,2,3)); QUERY PLAN --------------------------------------------------------------- Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) Filter: (i = ANY ('{1,2,3}'::integer[])) (5 rows) Plan with Patch ------------------------ postgres=# explain select * from t1 where t1.i not in (select t2.i from t2 where t2.i in (1,2,3)); QUERY PLAN ------------------------------------------------------------------------- Gather (cost=110.84..325.30 rows=8395 width=12) Workers Planned: 1 -> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) Filter: (i = ANY ('{1,2,3}'::integer[])) (7 rows) We have observed that Q-16 in TPC-H have been improved with the patch and the analysis of same will be shared by my colleague Rafia. Now, we can further extend this to parallelize queries containing correlated subplans like below: explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); QUERY PLAN ------------------------------------------------------------- Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12) Filter: (SubPlan 1) SubPlan 1 -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) Filter: (i = t1.i) (5 rows) In the above query, Filter on t2 (i = t1.i) generates Param node which is a parallel-restricted node, so such queries won't be able to use parallelism even with the patch. I think we can mark Params which refer to same level as parallel-safe and I think we have this information (node-> varlevelsup/ phlevelsup/ agglevelsup) available when we replace correlation vars (SS_replace_correlation_vars). The reason why it is not advisable to mark Params that don't refer to same query level as parallel-safe is that can lead to plans like below: Foo -> Gather -> Bar SubPlan 1 (SubPlan refers to Foo) Now, this problem is tricky because we need to pass all such Params each time we invoke Gather. That also is doable with much more effort, but I am not sure if it is worth because all of the use cases I have seen in TPC-H (Q-2) or TPC-DS (Q-6) always uses SubPlans that refer to same query level. Yet, another useful enhancement in this area could be to consider both parallel and non-parallel paths for subplans. As of now, we consider the cheapest/best path and form subplan from it, but it is quite possible that instead of choosing parallel path (in case it is cheapest) at subplan level, the non-parallel path at subplan level could be beneficial when upper plan can use parallelism. I think this will be a separate project in itself if we want to do this and based on my study of TPC-H and TPC-DS queries, I am confident that this will be helpful in certain queries at higher scale factors. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: