Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Parallel INSERT (INTO ... SELECT ...) |
Date | |
Msg-id | CA+HiwqH3eCaCgTgvhtBH03f8jAjnvKpmHp25Y4oQejGNYHuhLg@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel INSERT (INTO ... SELECT ...) (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Parallel INSERT (INTO ... SELECT ...)
|
List | pgsql-hackers |
On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Feb 10, 2021 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > It's parallel UNSAFE because it's not safe or even possible to have a > > > parallel plan for this. > > > (as UNSAFE is the max hazard level, no point in referencing > > > context->max_interesting). > > > And there are existing cases of bailing out and not doing further > > > safety checks (even your v15_delta.diff patch placed code - for > > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such > > > existing case in max_parallel_hazard_walker()): > > > > > > else if (IsA(node, Query)) > > > { > > > Query *query = (Query *) node; > > > > > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */ > > > if (query->rowMarks != NULL) > > > { > > > context->max_hazard = PROPARALLEL_UNSAFE; > > > return true; > > > } > > > > In my understanding, the max_parallel_hazard() query tree walk is to > > find constructs that are parallel unsafe in that they call code that > > can't run in parallel mode. For example, FOR UPDATE/SHARE on > > traditional heap AM tuples calls AssignTransactionId() which doesn't > > support being called in parallel mode. Likewise ON CONFLICT ... DO > > UPDATE calls heap_update() which doesn't support parallelism. I'm not > > aware of any such hazards downstream of ExecValuesScan(). > > > > > >You're trying to > > > > add something that bails based on second-guessing that a parallel path > > > > would not be chosen, which I find somewhat objectionable. > > > > > > > > If the main goal of bailing out is to avoid doing the potentially > > > > expensive modification safety check on the target relation, maybe we > > > > should try to somehow make the check less expensive. I remember > > > > reading somewhere in the thread about caching the result of this check > > > > in relcache, but haven't closely studied the feasibility of doing so. > > > > > > > > > > There's no "second-guessing" involved here. > > > There is no underlying way of dividing up the VALUES data of > > > "INSERT...VALUES" amongst the parallel workers, even if the planner > > > was updated to produce a parallel-plan for the "INSERT...VALUES" case > > > (apart from the fact that spawning off parallel workers to insert that > > > data would almost always result in worse performance than a > > > non-parallel plan...) > > > The division of work for parallel workers is part of the table AM > > > (scan) implementation, which is not invoked for "INSERT...VALUES". > > > > I don't disagree that the planner would not normally assign a parallel > > path simply to pull values out of a VALUES list mentioned in the > > INSERT command, but deciding something based on the certainty of it in > > an earlier planning phase seems odd to me. Maybe that's just me > > though. > > > > I think it is more of a case where neither is a need for parallelism > nor we want to support parallelism of it. The other possibility for > such a check could be at some earlier phase say in standard_planner > [1] where we are doing checks for other constructs where we don't want > parallelism (I think the check for 'parse->hasModifyingCTE' is quite > similar). If you see in that check as well we just assume other > operations to be in the category of parallel-unsafe. I think we should > rule out such cases earlier than later. Do you have better ideas than > what Greg has done to avoid parallelism for such cases? > > [1] - > standard_planner() > { > .. > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && > IsUnderPostmaster && > parse->commandType == CMD_SELECT && > !parse->hasModifyingCTE && > max_parallel_workers_per_gather > 0 && > !IsParallelWorker()) > { > /* all the cheap tests pass, so scan the query tree */ > glob->maxParallelHazard = max_parallel_hazard(parse); > glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE); > } > else > { > /* skip the query tree scan, just assume it's unsafe */ > glob->maxParallelHazard = PROPARALLEL_UNSAFE; > glob->parallelModeOK = false; > } Yeah, maybe having the block I was commenting on, viz.: + /* + * If there is no underlying SELECT, a parallel table-modification + * operation is not possible (nor desirable). + */ + hasSubQuery = false; + foreach(lc, parse->rtable) + { + rte = lfirst_node(RangeTblEntry, lc); + if (rte->rtekind == RTE_SUBQUERY) + { + hasSubQuery = true; + break; + } + } + if (!hasSubQuery) + return PROPARALLEL_UNSAFE; before the standard_planner() block you quoted might be a good idea. So something like this: + /* + * If there is no underlying SELECT, a parallel table-modification + * operation is not possible (nor desirable). + */ + rangeTablehasSubQuery = false; + foreach(lc, parse->rtable) + { + rte = lfirst_node(RangeTblEntry, lc); + if (rte->rtekind == RTE_SUBQUERY) + { + rangeTableHasSubQuery = true; + break; + } + } if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && IsUnderPostmaster && (parse->commandType == CMD_SELECT || (IsModifySupportedInParallelMode(parse->commandType) && rangeTableHasSubQuery)) && !parse->hasModifyingCTE && max_parallel_workers_per_gather > 0 && !IsParallelWorker()) { /* all the cheap tests pass, so scan the query tree */ glob->maxParallelHazard = max_parallel_hazard(parse); glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE); } else { /* skip the query tree scan, just assume it's unsafe */ glob->maxParallelHazard = PROPARALLEL_UNSAFE; glob->parallelModeOK = false; } -- Amit Langote EDB: http://www.enterprisedb.com
pgsql-hackers by date: