Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Parallel INSERT (INTO ... SELECT ...) |
Date | |
Msg-id | CA+HiwqGfJR8K2cg-M2k0DO59gMsQgYHBdgbtJ_U1TTpzmJOxTw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel INSERT (INTO ... SELECT ...) (Amit Langote <amitlangote09@gmail.com>) |
List | pgsql-hackers |
On Wed, Feb 10, 2021 at 5:50 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 10, 2021 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > It's parallel UNSAFE because it's not safe or even possible to have a > > > > parallel plan for this. > > > > (as UNSAFE is the max hazard level, no point in referencing > > > > context->max_interesting). > > > > And there are existing cases of bailing out and not doing further > > > > safety checks (even your v15_delta.diff patch placed code - for > > > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such > > > > existing case in max_parallel_hazard_walker()): > > > > > > > > else if (IsA(node, Query)) > > > > { > > > > Query *query = (Query *) node; > > > > > > > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */ > > > > if (query->rowMarks != NULL) > > > > { > > > > context->max_hazard = PROPARALLEL_UNSAFE; > > > > return true; > > > > } > > > > > > In my understanding, the max_parallel_hazard() query tree walk is to > > > find constructs that are parallel unsafe in that they call code that > > > can't run in parallel mode. For example, FOR UPDATE/SHARE on > > > traditional heap AM tuples calls AssignTransactionId() which doesn't > > > support being called in parallel mode. Likewise ON CONFLICT ... DO > > > UPDATE calls heap_update() which doesn't support parallelism. I'm not > > > aware of any such hazards downstream of ExecValuesScan(). > > > > > > > >You're trying to > > > > > add something that bails based on second-guessing that a parallel path > > > > > would not be chosen, which I find somewhat objectionable. > > > > > > > > > > If the main goal of bailing out is to avoid doing the potentially > > > > > expensive modification safety check on the target relation, maybe we > > > > > should try to somehow make the check less expensive. I remember > > > > > reading somewhere in the thread about caching the result of this check > > > > > in relcache, but haven't closely studied the feasibility of doing so. > > > > > > > > > > > > > There's no "second-guessing" involved here. > > > > There is no underlying way of dividing up the VALUES data of > > > > "INSERT...VALUES" amongst the parallel workers, even if the planner > > > > was updated to produce a parallel-plan for the "INSERT...VALUES" case > > > > (apart from the fact that spawning off parallel workers to insert that > > > > data would almost always result in worse performance than a > > > > non-parallel plan...) > > > > The division of work for parallel workers is part of the table AM > > > > (scan) implementation, which is not invoked for "INSERT...VALUES". > > > > > > I don't disagree that the planner would not normally assign a parallel > > > path simply to pull values out of a VALUES list mentioned in the > > > INSERT command, but deciding something based on the certainty of it in > > > an earlier planning phase seems odd to me. Maybe that's just me > > > though. > > > > > > > I think it is more of a case where neither is a need for parallelism > > nor we want to support parallelism of it. The other possibility for > > such a check could be at some earlier phase say in standard_planner > > [1] where we are doing checks for other constructs where we don't want > > parallelism (I think the check for 'parse->hasModifyingCTE' is quite > > similar). If you see in that check as well we just assume other > > operations to be in the category of parallel-unsafe. I think we should > > rule out such cases earlier than later. Do you have better ideas than > > what Greg has done to avoid parallelism for such cases? > > > > [1] - > > standard_planner() > > { > > .. > > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && > > IsUnderPostmaster && > > parse->commandType == CMD_SELECT && > > !parse->hasModifyingCTE && > > max_parallel_workers_per_gather > 0 && > > !IsParallelWorker()) > > { > > /* all the cheap tests pass, so scan the query tree */ > > glob->maxParallelHazard = max_parallel_hazard(parse); > > glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE); > > } > > else > > { > > /* skip the query tree scan, just assume it's unsafe */ > > glob->maxParallelHazard = PROPARALLEL_UNSAFE; > > glob->parallelModeOK = false; > > } > > Yeah, maybe having the block I was commenting on, viz.: > > + /* > + * If there is no underlying SELECT, a parallel table-modification > + * operation is not possible (nor desirable). > + */ > + hasSubQuery = false; > + foreach(lc, parse->rtable) > + { > + rte = lfirst_node(RangeTblEntry, lc); > + if (rte->rtekind == RTE_SUBQUERY) > + { > + hasSubQuery = true; > + break; > + } > + } > + if (!hasSubQuery) > + return PROPARALLEL_UNSAFE; > > before the standard_planner() block you quoted might be a good idea. > So something like this: > > + /* > + * If there is no underlying SELECT, a parallel table-modification > + * operation is not possible (nor desirable). > + */ > + rangeTablehasSubQuery = false; > + foreach(lc, parse->rtable) > + { > + rte = lfirst_node(RangeTblEntry, lc); > + if (rte->rtekind == RTE_SUBQUERY) > + { > + rangeTableHasSubQuery = true; > + break; > + } > + } > > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && > IsUnderPostmaster && > (parse->commandType == CMD_SELECT || > (IsModifySupportedInParallelMode(parse->commandType) && > rangeTableHasSubQuery)) && > !parse->hasModifyingCTE && > max_parallel_workers_per_gather > 0 && > !IsParallelWorker()) > { > /* all the cheap tests pass, so scan the query tree */ > glob->maxParallelHazard = max_parallel_hazard(parse); > glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE); > } > else > { > /* skip the query tree scan, just assume it's unsafe */ > glob->maxParallelHazard = PROPARALLEL_UNSAFE; > glob->parallelModeOK = false; > } On second thought, maybe we could even put the hasSubQuery-based short-circuit in the following block of max_parallel_hazard_walker(): /* * When we're first invoked on a completely unplanned tree, we must * recurse into subqueries so to as to locate parallel-unsafe constructs * anywhere in the tree. */ else if (IsA(node, Query)) { Query *query = (Query *) node; /* SELECT FOR UPDATE/SHARE must be treated as unsafe */ if (query->rowMarks != NULL) { context->max_hazard = PROPARALLEL_UNSAFE; return true; } Also, update the comment to mention we bail out if (!hasSubQuery) as a special case. -- Amit Langote EDB: http://www.enterprisedb.com
pgsql-hackers by date: