Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Parallel INSERT (INTO ... SELECT ...)
Date
Msg-id CAFiTN-v5UCEY8Jagocu5Aqqb7_6WcoWnOzLyUJ8tc1=NBdF_zg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel INSERT (INTO ... SELECT ...)  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Parallel INSERT (INTO ... SELECT ...)  (Greg Nancarrow <gregn4422@gmail.com>)
List pgsql-hackers
On Tue, Sep 29, 2020 at 8:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 28, 2020 at 8:45 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
> >
> > On Sat, Sep 26, 2020 at 3:30 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > > I further checked on full txn id and command id. Yes, these are
> > > getting passed to workers  via InitializeParallelDSM() ->
> > > SerializeTransactionState(). I tried to summarize what we need to do
> > > in case of parallel inserts in general i.e. parallel COPY, parallel
> > > inserts in INSERT INTO and parallel inserts in CTAS.
> > >
> > > In the leader:
> > >     GetCurrentFullTransactionId()
> > >     GetCurrentCommandId(true)
> > >     EnterParallelMode();
> > >     InitializeParallelDSM() --> calls SerializeTransactionState()
> > > (both full txn id and command id are serialized into parallel DSM)
> > >
> > > In the workers:
> > > ParallelWorkerMain() -->  calls StartParallelWorkerTransaction() (both
> > > full txn id and command id are restored into workers'
> > > CurrentTransactionState->fullTransactionId and currentCommandId)
> > > If the parallel workers are meant for insertions, then we need to set
> > > currentCommandIdUsed = true; Maybe we can lift the assert in
> > > GetCurrentCommandId(), if we don't want to touch that function, then
> > > we can have a new function GetCurrentCommandidInWorker() whose
> > > functionality will be same as GetCurrentCommandId() without the
> > > Assert(!IsParallelWorker());.
> > >
> > > Am I missing something?
> > >
> > > If the above points are true, we might have to update the parallel
> > > copy patch set, test the use cases and post separately in the parallel
> > > copy thread in coming days.
> > >
> >
> > Hi Bharath,
> >
> > I pretty much agree with your above points.
> >
> > I've attached an updated Parallel INSERT...SELECT patch, that:
> > - Only uses existing transaction state serialization support for
> > transfer of xid and cid.
> > - Adds a "SetCurrentCommandIdUsedForWorker" function, for setting
> > currentCommandIdUsed=true at the start of a parallel operation (used
> > for Parallel INSERT case, where we know the currentCommandId has been
> > assigned to the worker at the start of the parallel operation).
> > - Tweaks the Assert condition within "used=true" parameter case in
> > GetCurrentCommandId(), so that it only fires if in a parallel worker
> > and currentCommandId is false - refer to the updated comment in that
> > function.
> > - Does not modify any existing GetCurrentCommandId() calls.
> > - Does not remove any existing parallel-related asserts/checks, except
> > for the "cannot insert tuples in a parallel worker" error in
> > heap_prepare_insert(). I am still considering what to do with the
> > original error-check here.
> > [- Does not yet cater for other exclusion cases that you and Vignesh
> > have pointed out]
> >
> > This patch is mostly a lot cleaner, but does contain a possible ugly
> > hack, in that where it needs to call GetCurrentFullTransactionId(), it
> > must temporarily escape parallel-mode (recalling that parallel-mode is
> > set true right at the top of ExectePlan() in the cases of Parallel
> > INSERT/SELECT).
>
> I think you still need to work on the costing part, basically if we
> are parallelizing whole insert then plan is like below
>
> -> Gather
>   -> Parallel Insert
>       -> Parallel Seq Scan
>
> That means the tuple we are selecting via scan are not sent back to
> the gather node, so in cost_gather we need to see if it is for the
> INSERT then there is no row transferred through the parallel queue
> that mean we need not to pay any parallel tuple cost.

I just looked into the parallel CTAS[1] patch for the same thing, and
I can see in that patch it is being handled.

[1] https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Next
From: Tom Lane
Date:
Subject: Re: calling procedures is slow and consumes extra much memory against calling function