Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From Greg Nancarrow
Subject Re: Parallel Inserts in CREATE TABLE AS
Date
Msg-id CAJcOf-eCnsmN2mcg3XVeV+0mCmoSXe4=UbS=GewHdTjqy=vmCA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Fri, Mar 19, 2021 at 4:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> In an offlist discussion with Robert and Dilip, using fallocate to
> extend the relation may help to extend the relation faster. In regards
> to this, it looks like the AIO/DIO patch set of Andres [1]  which
> involves using fallocate() to extend files will surely be helpful.
> Until then, we honestly feel that the parallel inserts in CTAS patch
> set be put on hold and revive it later.
>

Hi,

I had partially reviewed some of the patches (first scan) when I was
alerted to your post and intention to put the patch on hold.
I thought I'd just post the comments I have so far, and you can look
at them at a later time when/if you revive the patch.


Patch 0001

1) Patch comment

Leader inserts its share of tuples if instructed to do, and so are workers

should be:

Leader inserts its share of tuples if instructed to, and so do the workers.


2)

void
SetParallelInsertState(ParallelInsertCmdKind ins_cmd, QueryDesc *queryDesc)
{
    GatherState *gstate;
    DestReceiver *dest;

    Assert(queryDesc && (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS));

    gstate = (GatherState *) queryDesc->planstate;
    dest = queryDesc->dest;

    /*
    * Parallel insertions are not possible either if the upper node is not
    * Gather or it's a Gather but it have some projections to perform.
    */
    if (!IsA(gstate, GatherState) || gstate->ps.ps_ProjInfo)
        return;


I think it would look better for code to be:


    dest = queryDesc->dest;

    /*
    * Parallel insertions are not possible either if the upper node is not
    * Gather or it's a Gather but it have some projections to perform.
    */
    if (!IsA(queryDesc->planstate, GatherState) ||
queryDesc->planstate.ps_ProjInfo)
        return;
    gstate = (GatherState *) queryDesc->planstate;


3) src/backend/executor/execParallel.c

+ pg_atomic_uint64 processed;

I am wondering, when there is contention from multiple workers in
writing back their processed count, how well does this work? Any
performance issues?
For the Parallel INSERT patch (which has not yet been committed) it
currently uses an array of processed counts for the workers (since #
of workers is capped) so there is never any contention related to
this.


4) src/backend/executor/execParallel.c

You shouldn't use intermingled declarations and code.
https://www.postgresql.org/docs/13/source-conventions.html

Best to move the uninitialized variable declaration to the top of the block:


ParallelInsertCTASInfo *info = NULL;
char *intoclause_str = NULL;
int intoclause_len;
char *intoclause_space = NULL;

should be:

int intoclause_len;
ParallelInsertCTASInfo *info = NULL;
char *intoclause_str = NULL;
char *intoclause_space = NULL;


5) ExecParallelGetInsReceiver

Would look better to have:

DR_intorel *receiver;

receiver = (DR_intorel *)CreateIntoRelDestReceiver(intoclause);

receiver->is_parallel_worker = true;
receiver->object_id = fpes->objectid;


6) GetParallelInsertCmdType

I think the following would be better:

ParallelInsertCmdKind
GetParallelInsertCmdType(DestReceiver *dest)
{
    if (dest &&
        dest->mydest == DestIntoRel &&
            ((DR_intorel *) dest)->is_parallel)
        return PARALLEL_INSERT_CMD_CREATE_TABLE_AS;

    return PARALLEL_INSERT_CMD_UNDEF;
}


7) IsParallelInsertAllowed

In the following code:

/* Below check may hit in case this function is called from explain.c. */
if (!(into && IsA(into, IntoClause)))
    return false;

If "into" is non-NULL, isn't it guaranteed to point at an IntoClause?

I think the code can just be:

/* Below check may hit in case this function is called from explain.c. */
if (!into)
    return false;

8) ExecGather

The comments and variable name are likely to cause confusion when the
parallel INSERT statement is implemented. Suggest minor change:

change:

   bool perform_parallel_ins = false;

to:

   bool perform_parallel_ins_no_readers = false;


change:

/*
* Do not create tuple queue readers for commands with parallel
* insertion. Because the gather node will not receive any
* tuples, the workers will insert the tuples into the target
* relation.
*/

to:

/*
* Do not create tuple queue readers for commands with parallel
* insertion that don't additionally return tuples. In this case,
* the workers will only insert the tuples into the target
* relation and the gather node will not receive any tuples.
*/


I think some changes in other areas are needed for the same reasons.



Patch 0002

1) I noticed that "rows" is not zero (and so is not displayed as 0 in
the EXPLAIN output for Gather) for the Gather node when parallel
inserts will be used. This doesn't seem to be right. I think that if
PARALLEL_INSERT_CAN_IGN_TUP_COST is set, path->rows should be set to
0, and just let existing "run_cost" be evaluated as normal (which will
be 0 as path->rows is 0).

2) Is PARALLEL_INSERT_TUP_COST_IGNORED actually needed? Couldn't only
PARALLEL_INSERT_CAN_IGN_TUP_COST be used for the purpose of ignoring
parallel tuple cost?


Regards,
Greg Nancarrow
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Next
From: David Steele
Date:
Subject: Re: Reduce the time required for a database recovery from archive.