Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers
From | Luc Vlaming |
---|---|
Subject | Re: Parallel Inserts in CREATE TABLE AS |
Date | |
Msg-id | af416701-d8b2-7a2e-22aa-a7b7f4179e68@swarm64.com Whole thread Raw |
In response to | Re: Parallel Inserts in CREATE TABLE AS (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
List | pgsql-hackers |
On 06-01-2021 09:32, Bharath Rupireddy wrote: > On Tue, Jan 5, 2021 at 1:25 PM Luc Vlaming <luc@swarm64.com> wrote: >>>>>> wrt v18-0002....patch: >>>>>> >>>>>> It looks like this introduces a state machine that goes like: >>>>>> - starts at CTAS_PARALLEL_INS_UNDEF >>>>>> - possibly moves to CTAS_PARALLEL_INS_SELECT >>>>>> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added >>>>>> - if both were added at some stage, we can go to >>>>>> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs >>>>>> >>>>>> what i'm wondering is why you opted to put logic around >>>>>> generate_useful_gather_paths and in cost_gather when to me it seems more >>>>>> logical to put it in create_gather_path? i'm probably missing something >>>>>> there? >>>>> >>>>> IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node. >>>>> And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which canonly create top node Gather. >>>>> So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node. >>> >>> Right. We wanted to ignore parallel tuple cost for only the upper Gather path. >>> >>>> I was wondering actually if we need the state machine. Reason is that as >>>> AFAICS the code could be placed in create_gather_path, where you can >>>> also check if it is a top gather node, whether the dest receiver is the >>>> right type, etc? To me that seems like a nicer solution as its makes >>>> that all logic that decides whether or not a parallel CTAS is valid is >>>> in a single place instead of distributed over various places. >>> >>> IMO, we can't determine the fact that we are going to generate the top >>> Gather path in create_gather_path. To decide on whether or not the top >>> Gather path generation, I think it's not only required to check the >>> root->query_level == 1 but we also need to rely on from where >>> generate_useful_gather_paths gets called. For instance, for >>> query_level 1, generate_useful_gather_paths gets called from 2 places >>> in apply_scanjoin_target_to_paths. Likewise, create_gather_path also >>> gets called from many places. IMO, the current way i.e. setting flag >>> it in apply_scanjoin_target_to_paths and ignoring based on that in >>> cost_gather seems safe. >>> >>> I may be wrong. Thoughts? >> >> So the way I understand it the requirements are: >> - it needs to be the top-most gather >> - it should not do anything with the rows after the gather node as this >> would make the parallel inserts conceptually invalid. > > Right. > >> Right now we're trying to judge what might be added on-top that could >> change the rows by inspecting all parts of the root object that would >> cause anything to be added, and add a little statemachine to track the >> state of that knowledge. To me this has the downside that the list in >> HAS_PARENT_PATH_GENERATING_CLAUSE has to be exhaustive, and we need to >> make sure it stays up-to-date, which could result in regressions if not >> tracked carefully. > > Right. Any new clause that will be added which generates an upper path > in grouping_planner after apply_scanjoin_target_to_paths also needs to > be added to HAS_PARENT_PATH_GENERATING_CLAUSE. Otherwise, we might > ignore the parallel tuple cost because of which the parallel plan may > be chosen and we go for parallel inserts only when the top node is > Gather. I don't think any new clause that will be added generates a > new upper Gather node in grouping_planner after > apply_scanjoin_target_to_paths. > >> Personally I would therefore go for a design which is safe in the sense >> that regressions are not as easily introduced. IMHO that could be done >> by inspecting the planned query afterwards, and then judging whether or >> not the parallel inserts are actually the right thing to do. > > The 0001 patch does that. It doesn't have any influence on the planner > for parallel tuple cost calculation, it just looks at the generated > plan and decides on parallel inserts. Having said that, we might miss > parallel plans even though we know that there will not be tuples > transferred from workers to Gather. So, 0002 patch adds the code for > influencing the planner for parallel tuple cost. > Ok. Thanks for the explanation and sorry for the confusion. >> Another way to create more safety against regressions would be to add an >> assert upon execution of the query that if we do parallel inserts that >> only a subset of allowed nodes exists above the gather node. > > Yes, we already do this. Please have a look at > SetParallelInsertState() in the 0002 patch. The idea is that in any > case, if the planner ignored the tuple cost, but we later not allow > parallel inserts either due to the upper node is not Gather or Gather > with projections. The assertion fails. So, in case any new parent path > generating clause is added (apart from the ones that are there in > HAS_PARENT_PATH_GENERATING_CLAUSE) and we ignore the tuple cost, then > this Assert will catch it. Currently, I couldn't find any assertion > failures in my debug build with make check and make check world. > Ok. Seems I missed that assert when reviewing. > + else > + { > + /* > + * Upper Gather node has projections, so parallel insertions are not > + * allowed. > + */ > + if (ins_cmd == PARALLEL_INSERT_CMD_CREATE_TABLE_AS) > + ((DR_intorel *) dest)->is_parallel = false; > + > + gstate->dest = NULL; > + > + /* > + * Before returning, ensure that we have not done wrong parallel tuple > + * cost enforcement in the planner. Main reason for this assertion is > + * to check if we enforced the planner to ignore the parallel tuple > + * cost (with the intention of choosing parallel inserts) due to which > + * the parallel plan may have been chosen, but we do not allow the > + * parallel inserts now. > + * > + * If we have correctly ignored parallel tuple cost in the planner > + * while creating Gather path, then this assertion failure should not > + * occur. In case it occurs, that means the planner may have chosen > + * this parallel plan because of our wrong enforcement. So let's try to > + * catch that here. > + */ > + Assert(tuple_cost_opts && !(*tuple_cost_opts & > + PARALLEL_INSERT_TUP_COST_IGNORED)); > + } > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Kind regards, Luc
pgsql-hackers by date: