Re: initial pruning in parallel append - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: initial pruning in parallel append |
Date | |
Msg-id | CA+HiwqGoTF_Vd37JJgkJCqufv7SZ+b=0kbw3Ee=Dj7BH_e2mPQ@mail.gmail.com Whole thread Raw |
In response to | Re: initial pruning in parallel append (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: initial pruning in parallel append
|
List | pgsql-hackers |
On Tue, Aug 8, 2023 at 12:53 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Aug 7, 2023 at 10:25 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Note we’re talking here about “initial” pruning that occurs during ExecInitNode(). Workers are only launched during ExecGather[Merge]()which thereafter do ExecInitNode() on their copy of the the plan tree. So if we are to pass the pruningresults for cross-checking, it will have to be from the leader to workers. > > That doesn't seem like a big problem because there aren't many node > types that do pruning, right? I think we're just talking about Append > and MergeAppend, or something like that, right? MergeAppend can't be parallel-aware atm, so only Append. > You just need the > ExecWhateverEstimate function to budget some DSM space to store the > information, which can basically just be a bitmap over the set of > child plans, and the ExecWhateverInitializeDSM copies the information > into that DSM space, and ExecWhateverInitializeWorker() copies the > information from the shared space back into the local node (or maybe > just points to it, if the representation is sufficiently compatible). > I feel like this is an hour or two's worth of coding, unless I'm > missing something, and WAY easier than trying to reason about what > happens if expression evaluation isn't as stable as we'd like it to > be. OK, I agree that we'd better share the pruning result between the leader and workers. I hadn't thought about putting the pruning result into Append's DSM (ParallelAppendState), which is what you're describing IIUC. I looked into it, though I'm not sure if it can be made to work given the way things are on the worker side, or at least not without some reshuffling of code in ParallelQueryMain(). The pruning result will have to be available in ExecInitAppend, but because the worker reads the DSM only after finishing the plan tree initialization, it won't. Perhaps, we can integrate ExecParallelInitializeWorker()'s responsibilities into ExecutorStart() / ExecInitNode() somehow? So change the ordering of the following code in ParallelQueryMain(): /* Start up the executor */ queryDesc->plannedstmt->jitFlags = fpes->jit_flags; ExecutorStart(queryDesc, fpes->eflags); /* Special executor initialization steps for parallel workers */ queryDesc->planstate->state->es_query_dsa = area; if (DsaPointerIsValid(fpes->param_exec)) { char *paramexec_space; paramexec_space = dsa_get_address(area, fpes->param_exec); RestoreParamExecParams(paramexec_space, queryDesc->estate); } pwcxt.toc = toc; pwcxt.seg = seg; ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt); Looking inside ExecParallelInitializeWorker(): static bool ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt) { if (planstate == NULL) return false; switch (nodeTag(planstate)) { case T_SeqScanState: if (planstate->plan->parallel_aware) ExecSeqScanInitializeWorker((SeqScanState *) planstate, pwcxt); I guess that'd mean putting the if (planstate->plan->parallel_aware) block seen here at the end of ExecInitSeqScan() and so on. Or we could consider something like the patch I mentioned in my 1st email. The idea there was to pass the pruning result via a separate channel, not the DSM chunk linked into the PlanState tree. To wit, on the leader side, ExecInitParallelPlan() puts the serialized List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY, alongside PlannedStmt, ParamListInfo, etc. The List-of-Bitmpaset is initialized during the leader's ExecInitNode(). On the worker side, ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts the resulting node into the QueryDesc, that ParallelQueryMain() then uses to do ExecutorStart() which copies the pointer to EState.es_part_prune_results. ExecInitAppend() consults EState.es_part_prune_results and uses the Bitmapset from there, if present, instead of performing initial pruning. I'm assuming it's not too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether it should be writing to EState.es_part_prune_results or reading from it -- the former if in the leader and the latter in a worker. If we are to go with this approach we will need to un-revert ec386948948c, which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes to a List in PlannedStmt (copied into EState.es_part_prune_infos), such that es_part_prune_results mirrors es_part_prune_infos. Thoughts? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
pgsql-hackers by date: