Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Parallel INSERT (INTO ... SELECT ...)
Date
Msg-id CA+HiwqGfJR8K2cg-M2k0DO59gMsQgYHBdgbtJ_U1TTpzmJOxTw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel INSERT (INTO ... SELECT ...)  (Amit Langote <amitlangote09@gmail.com>)
List pgsql-hackers
On Wed, Feb 10, 2021 at 5:50 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Feb 10, 2021 at 1:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > > > It's parallel UNSAFE because it's not safe or even possible to have a
> > > > parallel plan for this.
> > > > (as UNSAFE is the max hazard level, no point in referencing
> > > > context->max_interesting).
> > > > And there are existing cases of bailing out and not doing further
> > > > safety checks (even your v15_delta.diff patch placed code - for
> > > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> > > > existing case in max_parallel_hazard_walker()):
> > > >
> > > >     else if (IsA(node, Query))
> > > >     {
> > > >         Query      *query = (Query *) node;
> > > >
> > > >         /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > >         if (query->rowMarks != NULL)
> > > >         {
> > > >             context->max_hazard = PROPARALLEL_UNSAFE;
> > > >             return true;
> > > >         }
> > >
> > > In my understanding, the max_parallel_hazard() query tree walk is to
> > > find constructs that are parallel unsafe in that they call code that
> > > can't run in parallel mode.  For example, FOR UPDATE/SHARE on
> > > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > > support being called in parallel mode.  Likewise ON CONFLICT ... DO
> > > UPDATE calls heap_update() which doesn't support parallelism.  I'm not
> > > aware of any such hazards downstream of ExecValuesScan().
> > >
> > > > >You're trying to
> > > > > add something that bails based on second-guessing that a parallel path
> > > > > would not be chosen, which I find somewhat objectionable.
> > > > >
> > > > > If the main goal of bailing out is to avoid doing the potentially
> > > > > expensive modification safety check on the target relation, maybe we
> > > > > should try to somehow make the check less expensive.  I remember
> > > > > reading somewhere in the thread about caching the result of this check
> > > > > in relcache, but haven't closely studied the feasibility of doing so.
> > > > >
> > > >
> > > > There's no "second-guessing" involved here.
> > > > There is no underlying way of dividing up the VALUES data of
> > > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > > > (apart from the fact that spawning off parallel workers to insert that
> > > > data would almost always result in worse performance than a
> > > > non-parallel plan...)
> > > > The division of work for parallel workers is part of the table AM
> > > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> > >
> > > I don't disagree that the planner would not normally assign a parallel
> > > path simply to pull values out of a VALUES list mentioned in the
> > > INSERT command, but deciding something based on the certainty of it in
> > > an earlier planning phase seems odd to me.  Maybe that's just me
> > > though.
> > >
> >
> > I think it is more of a case where neither is a need for parallelism
> > nor we want to support parallelism of it. The other possibility for
> > such a check could be at some earlier phase say in standard_planner
> > [1] where we are doing checks for other constructs where we don't want
> > parallelism (I think the check for 'parse->hasModifyingCTE' is quite
> > similar). If you see in that check as well we just assume other
> > operations to be in the category of parallel-unsafe. I think we should
> > rule out such cases earlier than later. Do you have better ideas than
> > what Greg has done to avoid parallelism for such cases?
> >
> > [1] -
> > standard_planner()
> > {
> > ..
> > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> > IsUnderPostmaster &&
> > parse->commandType == CMD_SELECT &&
> > !parse->hasModifyingCTE &&
> > max_parallel_workers_per_gather > 0 &&
> > !IsParallelWorker())
> > {
> > /* all the cheap tests pass, so scan the query tree */
> > glob->maxParallelHazard = max_parallel_hazard(parse);
> > glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> > }
> > else
> > {
> > /* skip the query tree scan, just assume it's unsafe */
> > glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> > glob->parallelModeOK = false;
> > }
>
> Yeah, maybe having the block I was commenting on, viz.:
>
> +   /*
> +    * If there is no underlying SELECT, a parallel table-modification
> +    * operation is not possible (nor desirable).
> +    */
> +   hasSubQuery = false;
> +   foreach(lc, parse->rtable)
> +   {
> +       rte = lfirst_node(RangeTblEntry, lc);
> +       if (rte->rtekind == RTE_SUBQUERY)
> +       {
> +           hasSubQuery = true;
> +           break;
> +       }
> +   }
> +   if (!hasSubQuery)
> +       return PROPARALLEL_UNSAFE;
>
> before the standard_planner() block you quoted might be a good idea.
> So something like this:
>
> +   /*
> +    * If there is no underlying SELECT, a parallel table-modification
> +    * operation is not possible (nor desirable).
> +    */
> +   rangeTablehasSubQuery = false;
> +   foreach(lc, parse->rtable)
> +   {
> +       rte = lfirst_node(RangeTblEntry, lc);
> +       if (rte->rtekind == RTE_SUBQUERY)
> +       {
> +           rangeTableHasSubQuery = true;
> +           break;
> +       }
> +   }
>
>     if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
>         IsUnderPostmaster &&
>         (parse->commandType == CMD_SELECT ||
>          (IsModifySupportedInParallelMode(parse->commandType) &&
>           rangeTableHasSubQuery)) &&
>         !parse->hasModifyingCTE &&
>         max_parallel_workers_per_gather > 0 &&
>         !IsParallelWorker())
>     {
>         /* all the cheap tests pass, so scan the query tree */
>         glob->maxParallelHazard = max_parallel_hazard(parse);
>         glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
>     }
>     else
>     {
>         /* skip the query tree scan, just assume it's unsafe */
>         glob->maxParallelHazard = PROPARALLEL_UNSAFE;
>         glob->parallelModeOK = false;
>     }

On second thought, maybe we could even put the hasSubQuery-based
short-circuit in the following block of max_parallel_hazard_walker():

    /*
     * When we're first invoked on a completely unplanned tree, we must
     * recurse into subqueries so to as to locate parallel-unsafe constructs
     * anywhere in the tree.
     */
    else if (IsA(node, Query))
    {
        Query      *query = (Query *) node;

        /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
        if (query->rowMarks != NULL)
        {
            context->max_hazard = PROPARALLEL_UNSAFE;
            return true;
        }

Also, update the comment to mention we bail out if (!hasSubQuery) as a
special case.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: repeated decoding of prepared transactions
Next
From: Masahiko Sawada
Date:
Subject: Re: 64-bit XIDs in deleted nbtree pages