Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers
From | Greg Nancarrow |
---|---|
Subject | Re: Parallel Inserts in CREATE TABLE AS |
Date | |
Msg-id | CAJcOf-eCnsmN2mcg3XVeV+0mCmoSXe4=UbS=GewHdTjqy=vmCA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Inserts in CREATE TABLE AS (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
List | pgsql-hackers |
On Fri, Mar 19, 2021 at 4:33 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > In an offlist discussion with Robert and Dilip, using fallocate to > extend the relation may help to extend the relation faster. In regards > to this, it looks like the AIO/DIO patch set of Andres [1] which > involves using fallocate() to extend files will surely be helpful. > Until then, we honestly feel that the parallel inserts in CTAS patch > set be put on hold and revive it later. > Hi, I had partially reviewed some of the patches (first scan) when I was alerted to your post and intention to put the patch on hold. I thought I'd just post the comments I have so far, and you can look at them at a later time when/if you revive the patch. Patch 0001 1) Patch comment Leader inserts its share of tuples if instructed to do, and so are workers should be: Leader inserts its share of tuples if instructed to, and so do the workers. 2) void SetParallelInsertState(ParallelInsertCmdKind ins_cmd, QueryDesc *queryDesc) { GatherState *gstate; DestReceiver *dest; Assert(queryDesc && (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS)); gstate = (GatherState *) queryDesc->planstate; dest = queryDesc->dest; /* * Parallel insertions are not possible either if the upper node is not * Gather or it's a Gather but it have some projections to perform. */ if (!IsA(gstate, GatherState) || gstate->ps.ps_ProjInfo) return; I think it would look better for code to be: dest = queryDesc->dest; /* * Parallel insertions are not possible either if the upper node is not * Gather or it's a Gather but it have some projections to perform. */ if (!IsA(queryDesc->planstate, GatherState) || queryDesc->planstate.ps_ProjInfo) return; gstate = (GatherState *) queryDesc->planstate; 3) src/backend/executor/execParallel.c + pg_atomic_uint64 processed; I am wondering, when there is contention from multiple workers in writing back their processed count, how well does this work? Any performance issues? For the Parallel INSERT patch (which has not yet been committed) it currently uses an array of processed counts for the workers (since # of workers is capped) so there is never any contention related to this. 4) src/backend/executor/execParallel.c You shouldn't use intermingled declarations and code. https://www.postgresql.org/docs/13/source-conventions.html Best to move the uninitialized variable declaration to the top of the block: ParallelInsertCTASInfo *info = NULL; char *intoclause_str = NULL; int intoclause_len; char *intoclause_space = NULL; should be: int intoclause_len; ParallelInsertCTASInfo *info = NULL; char *intoclause_str = NULL; char *intoclause_space = NULL; 5) ExecParallelGetInsReceiver Would look better to have: DR_intorel *receiver; receiver = (DR_intorel *)CreateIntoRelDestReceiver(intoclause); receiver->is_parallel_worker = true; receiver->object_id = fpes->objectid; 6) GetParallelInsertCmdType I think the following would be better: ParallelInsertCmdKind GetParallelInsertCmdType(DestReceiver *dest) { if (dest && dest->mydest == DestIntoRel && ((DR_intorel *) dest)->is_parallel) return PARALLEL_INSERT_CMD_CREATE_TABLE_AS; return PARALLEL_INSERT_CMD_UNDEF; } 7) IsParallelInsertAllowed In the following code: /* Below check may hit in case this function is called from explain.c. */ if (!(into && IsA(into, IntoClause))) return false; If "into" is non-NULL, isn't it guaranteed to point at an IntoClause? I think the code can just be: /* Below check may hit in case this function is called from explain.c. */ if (!into) return false; 8) ExecGather The comments and variable name are likely to cause confusion when the parallel INSERT statement is implemented. Suggest minor change: change: bool perform_parallel_ins = false; to: bool perform_parallel_ins_no_readers = false; change: /* * Do not create tuple queue readers for commands with parallel * insertion. Because the gather node will not receive any * tuples, the workers will insert the tuples into the target * relation. */ to: /* * Do not create tuple queue readers for commands with parallel * insertion that don't additionally return tuples. In this case, * the workers will only insert the tuples into the target * relation and the gather node will not receive any tuples. */ I think some changes in other areas are needed for the same reasons. Patch 0002 1) I noticed that "rows" is not zero (and so is not displayed as 0 in the EXPLAIN output for Gather) for the Gather node when parallel inserts will be used. This doesn't seem to be right. I think that if PARALLEL_INSERT_CAN_IGN_TUP_COST is set, path->rows should be set to 0, and just let existing "run_cost" be evaluated as normal (which will be 0 as path->rows is 0). 2) Is PARALLEL_INSERT_TUP_COST_IGNORED actually needed? Couldn't only PARALLEL_INSERT_CAN_IGN_TUP_COST be used for the purpose of ignoring parallel tuple cost? Regards, Greg Nancarrow Fujitsu Australia
pgsql-hackers by date: