Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel INSERT (INTO ... SELECT ...)
Date
Msg-id CAA4eK1+4hX1V0yRpTke-AEz0+gn14UZ-BaEbohnQoTnW_Noarg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel INSERT (INTO ... SELECT ...)  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Mon, Sep 28, 2020 at 4:06 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > > I further checked on full txn id and command id. Yes, these are
> > > getting passed to workers  via InitializeParallelDSM() ->
> > > SerializeTransactionState(). I tried to summarize what we need to do
> > > in case of parallel inserts in general i.e. parallel COPY, parallel
> > > inserts in INSERT INTO and parallel inserts in CTAS.
> > >
> > > In the leader:
> > >     GetCurrentFullTransactionId()
> > >     GetCurrentCommandId(true)
> > >     EnterParallelMode();
> > >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > > (both full txn id and command id are serialized into parallel DSM)
> > >
> > > In the workers:
> > > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > > full txn id and command id are restored into workers'
> > > CurrentTransactionState->fullTransactionId and currentCommandId)
> > > If the parallel workers are meant for insertions, then we need to set
> > > currentCommandIdUsed = true; Maybe we can lift the assert in
> > > GetCurrentCommandId(), if we don't want to touch that function, then
> > > we can have a new function GetCurrentCommandidInWorker() whose
> > > functionality will be same as GetCurrentCommandId() without the
> > > Assert(!IsParallelWorker());.
> > >
> > > Am I missing something?
> > >
> > > If the above points are true, we might have to update the parallel
> > > copy patch set, test the use cases and post separately in the parallel
> > > copy thread in coming days.
> > >
> >
> > Hi Bharath,
> >
> > I pretty much agree with your above points.
> >
> > I've attached an updated Parallel INSERT...SELECT patch, that:
> > - Only uses existing transaction state serialization support for
> > transfer of xid and cid.
> > - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> > currentCommandIdUsed=true at the start of a parallel operation (used
> > for Parallel INSERT case, where we know the currentCommandId has been
> > assigned to the worker at the start of the parallel operation).
> > - Tweaks the Assert condition within "used=true" parameter case in
> > GetCurrentCommandId(), so that it only fires if in a parallel worker
> > and currentCommandId is false - refer to the updated comment in that
> > function.
> > - Does not modify any existing GetCurrentCommandId() calls.
> > - Does not remove any existing parallel-related asserts/checks, except
> > for the "cannot insert tuples in a parallel worker" error in
> > heap_prepare_insert(). I am still considering what to do with the
> > original error-check here.
> > [- Does not yet cater for other exclusion cases that you and Vignesh
> > have pointed out]
> >
> > This patch is mostly a lot cleaner, but does contain a possible ugly
> > hack, in that where it needs to call GetCurrentFullTransactionId(), it
> > must temporarily escape parallel-mode (recalling that parallel-mode is
> > set true right at the top of ExectePlan() in the cases of Parallel
> > INSERT/SELECT).
> >
>
> Thanks Greg.
>
> In general, see a few things common to all parallel insert
> cases(CTAS[1], COPY[2], INSERT INTO SELECTs):
> 1. Removal of "cannot insert tuples in a parallel worker" restriction
> from heap_prepare_insert()
> 2. Each worker should be able to set currentCommandIdUsed to true.
> 3. The change you proposed to make in GetCurrentCommandId()'s assert condition.
>
> Please add if I miss any other common point.
>
> Common solutions to each of the above points would be beneficial to
> all the parallel insert cases. How about having a common thread,
> discussion and a common patch for all the 3 points?
>

I am not sure if that is required at this stage, lets first sort out
other parts of the design because there could be other bigger problems
which we have not thought bout yet. I have already shared some
thoughts on those points in this thread, lets first get that done and
have the basic patch ready then if required we can discuss in detail
about these points in other thread.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: gs_group_1 crashing on 13beta2/s390x
Next
From: Bharath Rupireddy
Date:
Subject: Skip ExecCheckRTPerms in CTAS with no data