Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From Greg Nancarrow
Subject Re: Parallel INSERT (INTO ... SELECT ...)
Date
Msg-id CAJcOf-fhdbhP7PeKf-sgRuZG4DLPyrbQUbub=ho=A9dyNhhrnw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel INSERT (INTO ... SELECT ...)  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Fri, Oct 30, 2020 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > So then to avoid
> > errors from when parallel-mode is forced on and such unsafe INSERTs
> > are run, the only real choice is to only allow parallelModeNeeded to
> > be true for SELECT only (not INSERT), and this is kind of cheating and
> > also not picking up cases where parallel-safe INSERT is run but
> > invokes parallel-mode-unsafe features.
> > My conclusion, at least for the moment, is to leave the check where it is.
> >
>
> Okay, then can we integrate the functionality of
> MaxParallelHazardForModify in max_parallel_hazard? Calling it
> separately looks bit awkward.
>

Looking into that.

> >
> > > Few other comments on latest patch:
> > > ===============================
> > > 1.
> > > MaxRelParallelHazardForModify()
> > > {
> > > ..
> > > + if (commandType == CMD_INSERT || commandType == CMD_UPDATE)
> > > + {
> > > + /*
> > > ..
> > >
> > > Why to check CMD_UPDATE here?
> > >
> >
> > That was a bit of forward-thinking, for when/if UPDATE/DELETE is
> > supported in parallel-mode.
> > Column default expressions and check-constraints are only applicable
> > to INSERT and UPDATE.
> > Note however that currently this function can only ever be called with
> > commandType == CMD_INSERT.
> >
>
> I feel then for other command types there should be an Assert rather
> than try to handle something which is not yet implemented nor it is
> clear what all is required for that. It confuses the reader, at least
> it confused me. Probably we can write a comment but I don't think we
> should have any check for Update at this stage of work.
>

OK, for now I'll restrict the checks to INSERT, but I'll add comments
to assist with potential future UPDATE support.

> > > 2.
> > > +void PrepareParallelModeForModify(CmdType commandType, bool
> > > isParallelModifyLeader)
> > > +{
> > > + Assert(!IsInParallelMode());
> > > +
> > > + if (isParallelModifyLeader)
> > > + (void)GetCurrentCommandId(true);
> > > +
> > > + (void)GetCurrentFullTransactionId();
> > >
> > > Here, we should use GetCurrentTransactionId() similar to heap_insert
> > > or other heap operations. I am not sure why you have used
> > > GetCurrentFullTransactionId?
> > >
> >
> > GetCurrentTransactionId() and GetCurrentFullTransactionId() actually
> > have the same functionality, just a different return value (which is
> > not being used here).
> >
>
> Sure but lets use what is required.
>
> > But anyway I've changed it to use GetCurrentTransactionId().
> >
>
> But comments in ExecutePlan and PrepareParallelModeForModify still
> refer to FullTransactionId.
>

I believe those comments are technically correct.
GetCurrentTransactionId() calls AssignTransactionId() to do all the
work - and the comment for that function says "Assigns a new permanent
FullTransactionId to the given TransactionState".

> >
> >
> > > 4. Have you checked the overhead of this on the planner for different
> > > kinds of statements like inserts into tables having 100 or 500
> > > partitions? Similarly, it is good to check the overhead of domain
> > > related checks added in the patch.
> > >
> >
> > Checking that now and will post results soon.
> >
>

I am seeing a fair bit of overhead in the planning for the INSERT
parallel-safety checks (mind you, compared to the overall performance
gain, it's not too bad).
Some representative timings for a parallel INSERT of a millions rows
into 100,250 and 500 partitions are shown below.

(1) Without patch

# Partitions                    Planning Time (ms)
Execution Time (ms)
100                                1.014
           4176.435
250                                0.404
           3842.414
500                                0.529
           4440.633


(2) With Parallel INSERT patch

# Partitions                    Planning Time (ms)
Execution Time (ms)
100                                11.420
          2131.148
250                                23.269
          3472.259
500                                36.531
          3238.868

I'm looking into how this can be improved by better integration into
the current code, and addressing locking concerns that you've
previously mentioned.

Regards,
Greg Nancarrow
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: bulk typos
Next
From: Luc Vlaming
Date:
Subject: Re: should INSERT SELECT use a BulkInsertState?