Thread: [HACKERS] Parallel safety for extern params
As discussed in a nearby thread [1] (parallelize queries containing initplans), it appears that there are cases where queries referring PARAM_EXTERN params are treated as parallel-restricted even though they should be parallel-safe. I have done some further investigation and found that actually for most of the cases they are treated as parallel-restricted (due to tests in max_parallel_hazard_walker). The cases where such queries are treated as parallel-safe are when eval_const_expressions_mutator converts the param to const. So whenever we select the generic plan, it won't work. One simple example is: create table t1(c1 int, c2 char(500)); insert into t1 values(generate_series(1,10000),'aaa'); analyze t1; set force_parallel_mode = on; regression=# prepare t1_select(int) as Select c1 from t1 where c1 < $1; PREPARE regression=# explain (costs off, analyze) execute t1_select(10000); QUERY PLAN ------------------------------------------------------------------- Gather (actual time=35.252..44.727 rows=9999 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Seq Scan on t1 (actual time=0.364..5.638 rows=9999 loops=1) Filter: (c1 < 10000) Rows Removed by Filter: 1 Planning time: 2135.514 ms Execution time: 52.886 ms (9 rows) The next four executions will give the same plan, however, the sixth execution will give below plan: regression=# explain (costs off, analyze) execute t1_select(10005); QUERY PLAN -------------------------------------------------------------- Seq Scan on t1 (actual time=0.049..6.188 rows=10000 loops=1) Filter: (c1 < $1) Planning time: 22577.404 ms Execution time: 7.612 ms (4 rows) Attached patch fix_parallel_safety_for_extern_params_v1.patch fixes this problem. Note, for now, I have added a very simple test in regression tests to cover prepared statement case, but we might want to add something related to generic plan selection as I have shown above. I am just not sure if it is a good idea to have multiple executions just to get the generic plan. After fixing this problem, when I ran the regression tests with force_parallel_mode = regress, I saw multiple other failures. All the failures boil down to two kinds of cases: 1. There was an assumption while extracting simple expressions that the target list of gather node can contain constants or Var's. Now, once the above patch allows extern params as parallel-safe, that assumption no longer holds true. We need to handle params as well. Attached patch fix_simple_expr_interaction_gather_v1.patch handles that case. 2. We don't allow execution to use parallelism if the plan can be executed multiple times. This has been enforced in ExecutePlan, but it seems like that we miss to handle the case where we are already in parallel mode by the time we enforce that condition. So, what happens, as a result, is that it will allow to use parallelism when it shouldn't (when the same plan is executed multiple times) and lead to a crash. One way to fix is that we temporarily exit the parallel mode in such cases and reenter in the same state once the current execution is over. Attached patch fix_parallel_mode_nested_execution_v1.patch fixes this problem. Thanks to Kuntal who has helped in investigating the regression failures which happened as a result of making param extern params as parallel-safe. [1] - https://www.postgresql.org/message-id/CA%2BTgmobSN66o2_c76rY12JvS_sZa17zpKvpuyG-QzRvVLYE8Vg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Oct 13, 2017 at 1:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > After fixing this problem, when I ran the regression tests with > force_parallel_mode = regress, I saw multiple other failures. All the > failures boil down to two kinds of cases: > > 1. There was an assumption while extracting simple expressions that > the target list of gather node can contain constants or Var's. Now, > once the above patch allows extern params as parallel-safe, that > assumption no longer holds true. We need to handle params as well. > Attached patch fix_simple_expr_interaction_gather_v1.patch handles > that case. - * referencing the child node's output ... but setrefs.c might also have - * copied a Const as-is. + * referencing the child node's output or a Param... but setrefs.c might + * also have copied a Const as-is. I think the Param case should be mentioned after "... but" not before - i.e. referencing the child node's output... but setrefs.c might also have copied a Const or Param is-is. > 2. We don't allow execution to use parallelism if the plan can be > executed multiple times. This has been enforced in ExecutePlan, but > it seems like that we miss to handle the case where we are already in > parallel mode by the time we enforce that condition. So, what > happens, as a result, is that it will allow to use parallelism when it > shouldn't (when the same plan is executed multiple times) and lead to > a crash. One way to fix is that we temporarily exit the parallel mode > in such cases and reenter in the same state once the current execution > is over. Attached patch fix_parallel_mode_nested_execution_v1.patch > fixes this problem. This seems completely unsafe. If somebody's already entered parallel mode, they are counting on having the parallel-mode restrictions enforced until they exit parallel mode. We can't just disable those restrictions for a while in the middle and then put them back. I think the bug is in ExecGather(Merge): it assumes that if we're in parallel mode, it's OK to start workers. But actually, it shouldn't do this unless ExecutePlan ended up with use_parallel_mode == true, which isn't quite the same thing. I think we might need ExecutePlan to set a flag in the estate that ExecGather(Merge) can check. Thanks for working on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 14, 2017 at 1:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 13, 2017 at 1:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> After fixing this problem, when I ran the regression tests with >> force_parallel_mode = regress, I saw multiple other failures. All the >> failures boil down to two kinds of cases: >> >> 1. There was an assumption while extracting simple expressions that >> the target list of gather node can contain constants or Var's. Now, >> once the above patch allows extern params as parallel-safe, that >> assumption no longer holds true. We need to handle params as well. >> Attached patch fix_simple_expr_interaction_gather_v1.patch handles >> that case. > > - * referencing the child node's output ... but setrefs.c might also have > - * copied a Const as-is. > + * referencing the child node's output or a Param... but setrefs.c might > + * also have copied a Const as-is. > > I think the Param case should be mentioned after "... but" not before > - i.e. referencing the child node's output... but setrefs.c might also > have copied a Const or Param is-is. > I am not sure if we can write the comment like that (.. copied a Const or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a special handling for Var and Param where constants are copied as-is via expression_tree_mutator. Also as proposed, the handling for params is more like Var in exec_save_simple_expr. >> 2. We don't allow execution to use parallelism if the plan can be >> executed multiple times. This has been enforced in ExecutePlan, but >> it seems like that we miss to handle the case where we are already in >> parallel mode by the time we enforce that condition. So, what >> happens, as a result, is that it will allow to use parallelism when it >> shouldn't (when the same plan is executed multiple times) and lead to >> a crash. One way to fix is that we temporarily exit the parallel mode >> in such cases and reenter in the same state once the current execution >> is over. Attached patch fix_parallel_mode_nested_execution_v1.patch >> fixes this problem. > > This seems completely unsafe. If somebody's already entered parallel > mode, they are counting on having the parallel-mode restrictions > enforced until they exit parallel mode. We can't just disable those > restrictions for a while in the middle and then put them back. > Right. > I think the bug is in ExecGather(Merge): it assumes that if we're in > parallel mode, it's OK to start workers. But actually, it shouldn't > do this unless ExecutePlan ended up with use_parallel_mode == true, > which isn't quite the same thing. I think we might need ExecutePlan > to set a flag in the estate that ExecGather(Merge) can check. > Attached patch fixes the problem in a suggested way. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Oct 16, 2017 at 4:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Oct 14, 2017 at 1:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> I think the bug is in ExecGather(Merge): it assumes that if we're in >> parallel mode, it's OK to start workers. But actually, it shouldn't >> do this unless ExecutePlan ended up with use_parallel_mode == true, >> which isn't quite the same thing. I think we might need ExecutePlan >> to set a flag in the estate that ExecGather(Merge) can check. >> > > Attached patch fixes the problem in a suggested way. > > All the patches still apply and pass the regression test. I have added this to CF to avoid losing the track of this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 16, 2017 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think the Param case should be mentioned after "... but" not before >> - i.e. referencing the child node's output... but setrefs.c might also >> have copied a Const or Param is-is. > > I am not sure if we can write the comment like that (.. copied a Const > or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a > special handling for Var and Param where constants are copied as-is > via expression_tree_mutator. Also as proposed, the handling for > params is more like Var in exec_save_simple_expr. I committed fix_parallel_mode_nested_execution_v2.patch with some cosmetic tweaks. I back-patched it to 10 and 9.6, then had to fix some issues reported by Tom as followup commits. With respect to the bit quoted above, I rephrased the comment in a slightly different way that hopefully is a reasonable compromise, combined it with the main patch, and pushed it to master. Along the way I also got rid of the if statement you introduced and just made the Assert() more complicated instead, which seems better to me. When I tried back-porting the patch to v10 I discovered that the plpgsql changes conflict heavily and that ripping them out all the way produces regression failures under force_parallel_mode. I think I see why that's happening but it's not obvious how to me how to adapt b73f1b5c29e0ace5a281bc13ce09dea30e1b66de to the v10 code. Do you want to have a crack at it or should we just leave this as a master-only fix? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 28, 2017 at 2:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Oct 16, 2017 at 12:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > When I tried back-porting the patch to v10 I discovered that the > plpgsql changes conflict heavily and that ripping them out all the way > produces regression failures under force_parallel_mode. I think I see > why that's happening but it's not obvious how to me how to adapt > b73f1b5c29e0ace5a281bc13ce09dea30e1b66de to the v10 code. Do you want > to have a crack at it or should we just leave this as a master-only > fix? > I think we need to make changes in exec_simple_recheck_plan to make the behavior similar to HEAD. With the attached patch, all tests passed with force_parallel_mode. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Oct 28, 2017 at 8:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think we need to make changes in exec_simple_recheck_plan to make > the behavior similar to HEAD. With the attached patch, all tests > passed with force_parallel_mode. Committed to REL_10_STABLE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 29, 2017 at 9:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Oct 28, 2017 at 8:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think we need to make changes in exec_simple_recheck_plan to make >> the behavior similar to HEAD. With the attached patch, all tests >> passed with force_parallel_mode. > > Committed to REL_10_STABLE. > Thanks, I have closed this entry in CF app. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Oct 28, 2017 at 8:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think we need to make changes in exec_simple_recheck_plan to make >> the behavior similar to HEAD. With the attached patch, all tests >> passed with force_parallel_mode. > Committed to REL_10_STABLE. Sorry for resurrecting such an old thread, but I just noticed what commit 682ce911f did in exec_save_simple_expr: @@ -6588,8 +6588,8 @@ exec_save_simple_expr(PLpgSQL_expr *expr, CachedPlan *cplan) * force_parallel_mode is on, the planner might've stuck a Gather node * atop that. The simplest way to deal with this is to look through the * Gather node. The Gather node's tlist would normally contain a Var - * referencing the child node's output ... but setrefs.c might also have - * copied a Const as-is. + * referencing the child node's output, but it could also be a Param, or + * it could be a Const that setrefs.c copied as-is. */ plan = stmt->planTree; for (;;) @@ -6616,9 +6616,9 @@ exec_save_simple_expr(PLpgSQL_expr *expr, CachedPlan *cplan) /* If setrefs.c copied up a Const, no need to look further */ if (IsA(tle_expr, Const)) break; - /* Otherwise, it better be an outer Var */ - Assert(IsA(tle_expr, Var)); - Assert(((Var *) tle_expr)->varno == OUTER_VAR); + /* Otherwise, it had better be a Param or an outer Var */ + Assert(IsA(tle_expr, Param) || (IsA(tle_expr, Var) && + ((Var *) tle_expr)->varno == OUTER_VAR)); /* Descend to the child node */ plan = plan->lefttree; } I think this is completely wrong and should be reverted. There cannot be a Param there, and if there were it would not represent a reference to the Gather's child. The argument for this change seems to be what Amit said upthread: >>> I am not sure if we can write the comment like that (.. copied a Const >>> or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a >>> special handling for Var and Param where constants are copied as-is >>> via expression_tree_mutator. but AFAICS that is based on a misreading of what fix_upper_expr_mutator does: /* Special cases (apply only AFTER failing to match to lower tlist) */ if (IsA(node, Param)) return fix_param_node(context->root, (Param *) node); Note the comment. A Param that matches something in the original Result node would have been replaced by a Var reference to the lower Result. If we find a Param still surviving in the Gather's tlist, then it is not such a reference, and descending as though it were is wrong and dangerous. AFAICS, the only case where we'd actually find such a Param in a Gather is if it's a reference to the value of an initplan --- but exec_save_simple_expr has already asserted that there's no initPlans attached to the Gather. I tried reverting this code change, and check-world still passes, with or without debug_parallel_query = regress. So if there is a case I'm missing, the regression tests don't expose it. regards, tom lane
On Fri, Mar 21, 2025 at 11:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think this is completely wrong and should be reverted. There > cannot be a Param there, and if there were it would not represent > a reference to the Gather's child. > > I tried reverting this code change, and check-world still passes, > with or without debug_parallel_query = regress. So if there is > a case I'm missing, the regression tests don't expose it. Did you try the test case from the original post on this thread? I'm perfectly willing to believe that we messed up here -- this was 8 years ago and I don't remember the details -- but it surprises me a little bit that I would have committed that change without verifying that it was necessary to resolve the reported problem. However, I have made such mistakes before and will probably make them again, so it certainly could have happened. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Mar 21, 2025 at 11:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I tried reverting this code change, and check-world still passes, >> with or without debug_parallel_query = regress. So if there is >> a case I'm missing, the regression tests don't expose it. > Did you try the test case from the original post on this thread? Yeah. It's fine, which is unsurprising even if I'm wrong, because that test case has no involvement with plpgsql. > I'm perfectly willing to believe that we messed up here -- this was 8 > years ago and I don't remember the details -- but it surprises me a > little bit that I would have committed that change without verifying > that it was necessary to resolve the reported problem. Amit says in the original post that the regression tests failed under force_parallel_mode = regress without this hack; but that's demonstrably not true now. I spent a bit of quality time with "git bisect" and found that the failures stopped at 0f7ec8d9c3aeb8964d3539561e5c8d4caef42bf6 is the first new commit commit 0f7ec8d9c3aeb8964d3539561e5c8d4caef42bf6 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Dec 12 13:49:41 2018 -0500 Repair bogus handling of multi-assignment Params in upper plan levels. which appears unrelated if you just read the commit message, but the actual diff tells the tale: @@ -2325,8 +2325,6 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context) /* If not supplied by input plans, evaluate the contained expr */ return fix_join_expr_mutator((Node *) phv->phexpr, context); } - if (IsA(node, Param)) - return fix_param_node(context->root, (Param *) node); /* Try matching more complex expressions too, if tlists have any */ if (context->outer_itlist && context->outer_itlist->has_non_vars) { @@ -2344,6 +2342,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context) if (newvar) return (Node *) newvar; } + /* Special cases (apply only AFTER failing to match to lower tlist) */ + if (IsA(node, Param)) + return fix_param_node(context->root, (Param *) node); fix_expr_common(context->root, node); return expression_tree_mutator(node, fix_join_expr_mutator, (and similarly in fix_upper_expr_mutator). So actually, I had broken setrefs.c's matching of Params to lower plan levels with the multi-assignment business, and Amit was dodging that breakage. But this change is still wrong in itself: if anything, it should have returned the Param, not treated it as a reference to the child plan. regards, tom lane
On Fri, Mar 21, 2025 at 1:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > (and similarly in fix_upper_expr_mutator). So actually, I had broken > setrefs.c's matching of Params to lower plan levels with the > multi-assignment business, and Amit was dodging that breakage. > But this change is still wrong in itself: if anything, it should > have returned the Param, not treated it as a reference to the > child plan. Ah ha! Well, that makes me feel a bit better -- perhaps what I committed wasn't the right way to fix it, but at least I wasn't just committing stuff totally at random. I'm happy to have you tidy up here in whatever way seems best to you. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I'm happy to have you tidy up here in whatever way seems best to you. Cool, done. regards, tom lane
On Sat, Mar 22, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > I'm happy to have you tidy up here in whatever way seems best to you. > > Cool, done. > Thanks for taking care of this, and sorry for not digging deeper to find the appropriate fix. -- With Regards, Amit Kapila.