Thread: generic plans and "initial" pruning
Executing generic plans involving partitions is known to become slower as partition count grows due to a number of bottlenecks, with AcquireExecutorLocks() showing at the top in profiles. Previous attempt at solving that problem was by David Rowley [1], where he proposed delaying locking of *all* partitions appearing under an Append/MergeAppend until "initial" pruning is done during the executor initialization phase. A problem with that approach that he has described in [2] is that leaving partitions unlocked can lead to race conditions where the Plan node belonging to a partition can be invalidated when a concurrent session successfully alters the partition between AcquireExecutorLocks() saying the plan is okay to execute and then actually executing it. However, using an idea that Robert suggested to me off-list a little while back, it seems possible to determine the set of partitions that we can safely skip locking. The idea is to look at the "initial" or "pre-execution" pruning instructions contained in a given Append or MergeAppend node when AcquireExecutorLocks() is collecting the relations to lock and consider relations from only those sub-nodes that survive performing those instructions. I've attempted implementing that idea in the attached patch. Note that "initial" pruning steps are now performed twice when executing generic plans: once in AcquireExecutorLocks() to find partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to determine the set of partition sub-nodes to be initialized for execution, though I wasn't able to come up with a good idea to avoid this duplication. Using the following benchmark setup: pgbench testdb -i --partitions=$nparts > /dev/null 2>&1 pgbench -n testdb -S -T 30 -Mprepared And plan_cache_mode = force_generic_plan, I get following numbers: HEAD: 32 tps = 20561.776403 (without initial connection time) 64 tps = 12553.131423 (without initial connection time) 128 tps = 13330.365696 (without initial connection time) 256 tps = 8605.723120 (without initial connection time) 512 tps = 4435.951139 (without initial connection time) 1024 tps = 2346.902973 (without initial connection time) 2048 tps = 1334.680971 (without initial connection time) Patched: 32 tps = 27554.156077 (without initial connection time) 64 tps = 27531.161310 (without initial connection time) 128 tps = 27138.305677 (without initial connection time) 256 tps = 25825.467724 (without initial connection time) 512 tps = 19864.386305 (without initial connection time) 1024 tps = 18742.668944 (without initial connection time) 2048 tps = 16312.412704 (without initial connection time) -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com
Attachment
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Executing generic plans involving partitions is known to become slower > as partition count grows due to a number of bottlenecks, with > AcquireExecutorLocks() showing at the top in profiles. > > Previous attempt at solving that problem was by David Rowley [1], > where he proposed delaying locking of *all* partitions appearing under > an Append/MergeAppend until "initial" pruning is done during the > executor initialization phase. A problem with that approach that he > has described in [2] is that leaving partitions unlocked can lead to > race conditions where the Plan node belonging to a partition can be > invalidated when a concurrent session successfully alters the > partition between AcquireExecutorLocks() saying the plan is okay to > execute and then actually executing it. > > However, using an idea that Robert suggested to me off-list a little > while back, it seems possible to determine the set of partitions that > we can safely skip locking. The idea is to look at the "initial" or > "pre-execution" pruning instructions contained in a given Append or > MergeAppend node when AcquireExecutorLocks() is collecting the > relations to lock and consider relations from only those sub-nodes > that survive performing those instructions. I've attempted > implementing that idea in the attached patch. > In which cases, we will have "pre-execution" pruning instructions that can be used to skip locking partitions? Can you please give a few examples where this approach will be useful? The benchmark is showing good results, indeed. -- Best Wishes, Ashutosh Bapat
On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Executing generic plans involving partitions is known to become slower
> as partition count grows due to a number of bottlenecks, with
> AcquireExecutorLocks() showing at the top in profiles.
>
> Previous attempt at solving that problem was by David Rowley [1],
> where he proposed delaying locking of *all* partitions appearing under
> an Append/MergeAppend until "initial" pruning is done during the
> executor initialization phase. A problem with that approach that he
> has described in [2] is that leaving partitions unlocked can lead to
> race conditions where the Plan node belonging to a partition can be
> invalidated when a concurrent session successfully alters the
> partition between AcquireExecutorLocks() saying the plan is okay to
> execute and then actually executing it.
>
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking. The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions. I've attempted
> implementing that idea in the attached patch.
>
In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?
This is mainly to be useful for prepared queries, so something like:
prepare q as select * from partitioned_table where key = $1;
And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for all partitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the number of partitions to determine that the plan is still usable / has not been invalidated; most of that is AcquireExecutorLocks().
Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process the range table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the plan is generic.
The benchmark is showing good results, indeed.
Thanks.
Amit Langote
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Fri, Dec 31, 2021 at 7:56 AM Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: >> >> On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote: >> > >> > Executing generic plans involving partitions is known to become slower >> > as partition count grows due to a number of bottlenecks, with >> > AcquireExecutorLocks() showing at the top in profiles. >> > >> > Previous attempt at solving that problem was by David Rowley [1], >> > where he proposed delaying locking of *all* partitions appearing under >> > an Append/MergeAppend until "initial" pruning is done during the >> > executor initialization phase. A problem with that approach that he >> > has described in [2] is that leaving partitions unlocked can lead to >> > race conditions where the Plan node belonging to a partition can be >> > invalidated when a concurrent session successfully alters the >> > partition between AcquireExecutorLocks() saying the plan is okay to >> > execute and then actually executing it. >> > >> > However, using an idea that Robert suggested to me off-list a little >> > while back, it seems possible to determine the set of partitions that >> > we can safely skip locking. The idea is to look at the "initial" or >> > "pre-execution" pruning instructions contained in a given Append or >> > MergeAppend node when AcquireExecutorLocks() is collecting the >> > relations to lock and consider relations from only those sub-nodes >> > that survive performing those instructions. I've attempted >> > implementing that idea in the attached patch. >> > >> >> In which cases, we will have "pre-execution" pruning instructions that >> can be used to skip locking partitions? Can you please give a few >> examples where this approach will be useful? > > > This is mainly to be useful for prepared queries, so something like: > > prepare q as select * from partitioned_table where key = $1; > > And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for allpartitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the numberof partitions to determine that the plan is still usable / has not been invalidated; most of that is AcquireExecutorLocks(). > > Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process therange table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the planis generic. > >> The benchmark is showing good results, indeed. > Indeed. Here are few comments for v1 patch: + /* Caller error if we get here without contains_init_steps */ + Assert(pruneinfo->contains_init_steps); - prunedata = prunestate->partprunedata[i]; - pprune = &prunedata->partrelprunedata[0]; - /* Perform pruning without using PARAM_EXEC Params */ - find_matching_subplans_recurse(prunedata, pprune, true, &result); + if (parentrelids) + *parentrelids = NULL; You got two blank lines after Assert. -- + /* Set up EState if not in the executor proper. */ + if (estate == NULL) + { + estate = CreateExecutorState(); + estate->es_param_list_info = params; + free_estate = true; } ... [Skip] + if (free_estate) + { + FreeExecutorState(estate); + estate = NULL; } I think this work should be left to the caller. -- /* * Stuff that follows matches exactly what ExecCreatePartitionPruneState() * does, except we don't need a PartitionPruneState here, so don't call * that function. * * XXX some refactoring might be good. */ +1, while doing it would be nice if foreach_current_index() is used instead of the i & j sequence in the respective foreach() block, IMO. -- + while ((i = bms_next_member(validsubplans, i)) >= 0) + { + Plan *subplan = list_nth(subplans, i); + + context->relations = + bms_add_members(context->relations, + get_plan_scanrelids(subplan)); + } I think instead of get_plan_scanrelids() the GetLockableRelations_worker() can be used; if so, then no need to add get_plan_scanrelids() function. -- /* Nodes containing prunable subnodes. */ + case T_MergeAppend: + { + PlannedStmt *plannedstmt = context->plannedstmt; + List *rtable = plannedstmt->rtable; + ParamListInfo params = context->params; + PartitionPruneInfo *pruneinfo; + Bitmapset *validsubplans; + Bitmapset *parentrelids; ... if (pruneinfo && pruneinfo->contains_init_steps) { int i; ... return false; } } break; Most of the declarations need to be moved inside the if-block. Also, initially, I was a bit concerned regarding this code block inside GetLockableRelations_worker(), what if (pruneinfo && pruneinfo->contains_init_steps) evaluated to false? After debugging I realized that plan_tree_walker() will do the needful -- a bit of comment would have helped. -- + case T_CustomScan: + foreach(lc, ((CustomScan *) plan)->custom_plans) + { + if (walker((Plan *) lfirst(lc), context)) + return true; + } + break; Why not plan_walk_members() call like other nodes? Regards, Amul
On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > However, using an idea that Robert suggested to me off-list a little > while back, it seems possible to determine the set of partitions that > we can safely skip locking. The idea is to look at the "initial" or > "pre-execution" pruning instructions contained in a given Append or > MergeAppend node when AcquireExecutorLocks() is collecting the > relations to lock and consider relations from only those sub-nodes > that survive performing those instructions. I've attempted > implementing that idea in the attached patch. Hmm. The first question that occurs to me is whether this is fully safe. Currently, AcquireExecutorLocks calls LockRelationOid for every relation involved in the query. That means we will probably lock at least one relation on which we previously had no lock and thus AcceptInvalidationMessages(). That will end up marking the query as no longer valid and CheckCachedPlan() will realize this and tell the caller to replan. In the corner case where we already hold all the required locks, we will not accept invalidation messages at this point, but must have done so after acquiring the last of the locks required, and if that didn't mark the plan invalid, it can't be invalid now either. Either way, everything is fine. With the proposed patch, we might never lock some of the relations involved in the query. Therefore, if one of those relations has been modified in some way that would invalidate the plan, we will potentially fail to discover this, and will use the plan anyway. For instance, suppose there's one particular partition that has an extra index and the plan involves an Index Scan using that index. Now suppose that the scan of the partition in question is pruned, but meanwhile, the index has been dropped. Now we're running a plan that scans a nonexistent index. Admittedly, we're not running that part of the plan. But is that enough for this to be safe? There are things (like EXPLAIN or auto_explain) that we might try to do even on a part of the plan tree that we don't try to run. Those things might break, because for example we won't be able to look up the name of an index in the catalogs for EXPLAIN output if the index is gone. This is just a relatively simple example and I think there are probably a bunch of others. There are a lot of kinds of DDL that could be performed on a partition that gets pruned away: DROP INDEX is just one example. The point is that to my knowledge we have no existing case where we try to use a plan that might be only partly valid, so if we introduce one, there's some risk there. I thought for a while, too, about whether changes to some object in a part of the plan that we're not executing could break things for the rest of the plan even if we never do anything with the plan but execute it. I can't quite see any actual hazard. For example, I thought about whether we might try to get the tuple descriptor for the pruned-away object and get a different tuple descriptor than we were expecting. I think we can't, because (1) the pruned object has to be a partition, and tuple descriptors have to match throughout the partitioning hierarchy, except for column ordering, which currently can't be changed after-the-fact and (2) IIRC, the tuple descriptor is stored in the plan and not reconstructed at runtime and (3) if we don't end up opening the relation because it's pruned, then we certainly can't do anything with its tuple descriptor. But it might be worth giving more thought to the question of whether there's any other way we could be depending on the details of an object that ended up getting pruned. > Note that "initial" pruning steps are now performed twice when > executing generic plans: once in AcquireExecutorLocks() to find > partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to > determine the set of partition sub-nodes to be initialized for > execution, though I wasn't able to come up with a good idea to avoid > this duplication. I think this is something that will need to be fixed somehow. Apart from the CPU cost, it's scary to imagine that the set of nodes on which we acquired locks might be different from the set of nodes that we initialize. If we do the same computation twice, there must be some non-zero probability of getting a different answer the second time, even if the circumstances under which it would actually happen are remote. Consider, for example, a function that is labeled IMMUTABLE but is really VOLATILE. Now maybe you can get the system to lock one set of partitions and then initialize a different set of partitions. I don't think we want to try to reason about what consequences that might have and prove that somehow it's going to be OK; I think we want to nail the door shut very tightly to make sure that it can't. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for taking the time to look at this. On Wed, Jan 12, 2022 at 1:22 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > > However, using an idea that Robert suggested to me off-list a little > > while back, it seems possible to determine the set of partitions that > > we can safely skip locking. The idea is to look at the "initial" or > > "pre-execution" pruning instructions contained in a given Append or > > MergeAppend node when AcquireExecutorLocks() is collecting the > > relations to lock and consider relations from only those sub-nodes > > that survive performing those instructions. I've attempted > > implementing that idea in the attached patch. > > Hmm. The first question that occurs to me is whether this is fully safe. > > Currently, AcquireExecutorLocks calls LockRelationOid for every > relation involved in the query. That means we will probably lock at > least one relation on which we previously had no lock and thus > AcceptInvalidationMessages(). That will end up marking the query as no > longer valid and CheckCachedPlan() will realize this and tell the > caller to replan. In the corner case where we already hold all the > required locks, we will not accept invalidation messages at this > point, but must have done so after acquiring the last of the locks > required, and if that didn't mark the plan invalid, it can't be > invalid now either. Either way, everything is fine. > > With the proposed patch, we might never lock some of the relations > involved in the query. Therefore, if one of those relations has been > modified in some way that would invalidate the plan, we will > potentially fail to discover this, and will use the plan anyway. For > instance, suppose there's one particular partition that has an extra > index and the plan involves an Index Scan using that index. Now > suppose that the scan of the partition in question is pruned, but > meanwhile, the index has been dropped. Now we're running a plan that > scans a nonexistent index. Admittedly, we're not running that part of > the plan. But is that enough for this to be safe? There are things > (like EXPLAIN or auto_explain) that we might try to do even on a part > of the plan tree that we don't try to run. Those things might break, > because for example we won't be able to look up the name of an index > in the catalogs for EXPLAIN output if the index is gone. > > This is just a relatively simple example and I think there are > probably a bunch of others. There are a lot of kinds of DDL that could > be performed on a partition that gets pruned away: DROP INDEX is just > one example. The point is that to my knowledge we have no existing > case where we try to use a plan that might be only partly valid, so if > we introduce one, there's some risk there. I thought for a while, too, > about whether changes to some object in a part of the plan that we're > not executing could break things for the rest of the plan even if we > never do anything with the plan but execute it. I can't quite see any > actual hazard. For example, I thought about whether we might try to > get the tuple descriptor for the pruned-away object and get a > different tuple descriptor than we were expecting. I think we can't, > because (1) the pruned object has to be a partition, and tuple > descriptors have to match throughout the partitioning hierarchy, > except for column ordering, which currently can't be changed > after-the-fact and (2) IIRC, the tuple descriptor is stored in the > plan and not reconstructed at runtime and (3) if we don't end up > opening the relation because it's pruned, then we certainly can't do > anything with its tuple descriptor. But it might be worth giving more > thought to the question of whether there's any other way we could be > depending on the details of an object that ended up getting pruned. I have pondered on the possible hazards before writing the patch, mainly because the concerns about a previously discussed proposal were along similar lines [1]. IIUC, you're saying the plan tree is subject to inspection by non-core code before ExecutorStart() has initialized a PlanState tree, which must have discarded pruned portions of the plan tree. I wouldn't claim to have scanned *all* of the core code that could possibly access the invalidated portions of the plan tree, but from what I have seen, I couldn't find any site that does. An ExecutorStart_hook() gets to do that, but from what I can see it is expected to call standard_ExecutorStart() before doing its thing and supposedly only looks at the PlanState tree, which must be valid. Actually, EXPLAIN also does ExecutorStart() before starting to look at the plan (the PlanState tree), so must not run into pruned plan tree nodes. All that said, it does sound like wishful thinking to say that no problems can possibly occur. At first, I had tried to implement this such that the Append/MergeAppend nodes are edited to record the result of initial pruning, but it felt wrong to be munging the plan tree in plancache.c. Or, maybe this won't be a concern if performing ExecutorStart() is made a part of CheckCachedPlan() somehow, which would then take locks on the relation as the PlanState tree is built capturing any plan invalidations, instead of AcquireExecutorLocks(). That does sound like an ambitious undertaking though. > > Note that "initial" pruning steps are now performed twice when > > executing generic plans: once in AcquireExecutorLocks() to find > > partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to > > determine the set of partition sub-nodes to be initialized for > > execution, though I wasn't able to come up with a good idea to avoid > > this duplication. > > I think this is something that will need to be fixed somehow. Apart > from the CPU cost, it's scary to imagine that the set of nodes on > which we acquired locks might be different from the set of nodes that > we initialize. If we do the same computation twice, there must be some > non-zero probability of getting a different answer the second time, > even if the circumstances under which it would actually happen are > remote. Consider, for example, a function that is labeled IMMUTABLE > but is really VOLATILE. Now maybe you can get the system to lock one > set of partitions and then initialize a different set of partitions. I > don't think we want to try to reason about what consequences that > might have and prove that somehow it's going to be OK; I think we want > to nail the door shut very tightly to make sure that it can't. Yeah, the premise of the patch is that "initial" pruning steps produce the same result both times. I assumed that would be true because the pruning steps are not allowed to contain any VOLATILE expressions. Regarding the possibility that IMMUTABLE labeling of functions may be incorrect, I haven't considered if the runtime pruning code can cope or whether it should try to. If such a case does occur in practice, the bad outcome would be an Assert failure in ExecGetRangeTableRelation() or using a partition unlocked in the non-assert builds, the latter of which feels especially bad. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BTgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc%2B7%2BZ7pyNg%40mail.gmail.com
On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote: > I have pondered on the possible hazards before writing the patch, > mainly because the concerns about a previously discussed proposal were > along similar lines [1]. True. I think that the hazards are narrower with this proposal, because if you *delay* locking a partition that you eventually need, then you might end up trying to actually execute a portion of the plan that's no longer valid. That seems like hopelessly bad news. On the other hand, with this proposal, you skip locking altogether, but only for parts of the plan that you don't plan to execute. That's still kind of scary, but not to nearly the same degree. > IIUC, you're saying the plan tree is subject to inspection by non-core > code before ExecutorStart() has initialized a PlanState tree, which > must have discarded pruned portions of the plan tree. I wouldn't > claim to have scanned *all* of the core code that could possibly > access the invalidated portions of the plan tree, but from what I have > seen, I couldn't find any site that does. An ExecutorStart_hook() > gets to do that, but from what I can see it is expected to call > standard_ExecutorStart() before doing its thing and supposedly only > looks at the PlanState tree, which must be valid. Actually, EXPLAIN > also does ExecutorStart() before starting to look at the plan (the > PlanState tree), so must not run into pruned plan tree nodes. All > that said, it does sound like wishful thinking to say that no problems > can possibly occur. Yeah. I don't think it's only non-core code we need to worry about either. What if I just do EXPLAIN ANALYZE on a prepared query that ends up pruning away some stuff? IIRC, the pruned subplans are not shown, so we might escape disaster here, but FWIW if I'd committed that code I would have pushed hard for showing those and saying "(not executed)" .... so it's not too crazy to imagine a world in which things work that way. > At first, I had tried to implement this such that the > Append/MergeAppend nodes are edited to record the result of initial > pruning, but it felt wrong to be munging the plan tree in plancache.c. It is. You can't munge the plan tree: it's required to be strictly read-only once generated. It can be serialized and deserialized for transmission to workers, and it can be shared across executions. > Or, maybe this won't be a concern if performing ExecutorStart() is > made a part of CheckCachedPlan() somehow, which would then take locks > on the relation as the PlanState tree is built capturing any plan > invalidations, instead of AcquireExecutorLocks(). That does sound like > an ambitious undertaking though. On the surface that would seem to involve abstraction violations, but maybe that could be finessed somehow. The plancache shouldn't know too much about what the executor is going to do with the plan, but it could ask the executor to perform a step that has been designed for use by the plancache. I guess the core problem here is how to pass around information that is node-specific before we've stood up the executor state tree. Maybe the executor could have a function that does the pruning and returns some kind of array of results that can be used both to decide what to lock and also what to consider as pruned at the start of execution. (I'm hand-waving about the details because I don't know.) > Yeah, the premise of the patch is that "initial" pruning steps produce > the same result both times. I assumed that would be true because the > pruning steps are not allowed to contain any VOLATILE expressions. > Regarding the possibility that IMMUTABLE labeling of functions may be > incorrect, I haven't considered if the runtime pruning code can cope > or whether it should try to. If such a case does occur in practice, > the bad outcome would be an Assert failure in > ExecGetRangeTableRelation() or using a partition unlocked in the > non-assert builds, the latter of which feels especially bad. Right. I think it's OK for a query to produce wrong answers under those kinds of conditions - the user has broken everything and gets to keep all the pieces - but doing stuff that might violate fundamental assumptions of the system like "relations can only be accessed when holding a lock on them" feels quite bad. It's not a stretch to imagine that failing to follow those invariants could take the whole system down, which is clearly too severe a consequence for the user's failure to label things properly. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote: > Here are few comments for v1 patch: Thanks Amul. I'm thinking about Robert's latest comments addressing which may need some rethinking of this whole design, but I decided to post a v2 taking care of your comments. > + /* Caller error if we get here without contains_init_steps */ > + Assert(pruneinfo->contains_init_steps); > > - prunedata = prunestate->partprunedata[i]; > - pprune = &prunedata->partrelprunedata[0]; > > - /* Perform pruning without using PARAM_EXEC Params */ > - find_matching_subplans_recurse(prunedata, pprune, true, &result); > + if (parentrelids) > + *parentrelids = NULL; > > You got two blank lines after Assert. Fixed. > -- > > + /* Set up EState if not in the executor proper. */ > + if (estate == NULL) > + { > + estate = CreateExecutorState(); > + estate->es_param_list_info = params; > + free_estate = true; > } > > ... [Skip] > > + if (free_estate) > + { > + FreeExecutorState(estate); > + estate = NULL; > } > > I think this work should be left to the caller. Done. Also see below... > /* > * Stuff that follows matches exactly what ExecCreatePartitionPruneState() > * does, except we don't need a PartitionPruneState here, so don't call > * that function. > * > * XXX some refactoring might be good. > */ > > +1, while doing it would be nice if foreach_current_index() is used > instead of the i & j sequence in the respective foreach() block, IMO. Actually, I rewrote this part quite significantly so that most of the code remains in its existing place. I decided to let GetLockableRelations_walker() create a PartitionPruneState and pass that to ExecFindInitialMatchingSubPlans() that is now left more or less as is. Instead, ExecCreatePartitionPruneState() is changed to be callable from outside the executor. The temporary EState is no longer necessary. ExprContext, PartitionDirectory, etc. are now managed in the caller, GetLockableRelations_walker(). > -- > > + while ((i = bms_next_member(validsubplans, i)) >= 0) > + { > + Plan *subplan = list_nth(subplans, i); > + > + context->relations = > + bms_add_members(context->relations, > + get_plan_scanrelids(subplan)); > + } > > I think instead of get_plan_scanrelids() the > GetLockableRelations_worker() can be used; if so, then no need to add > get_plan_scanrelids() function. You're right, done. > -- > > /* Nodes containing prunable subnodes. */ > + case T_MergeAppend: > + { > + PlannedStmt *plannedstmt = context->plannedstmt; > + List *rtable = plannedstmt->rtable; > + ParamListInfo params = context->params; > + PartitionPruneInfo *pruneinfo; > + Bitmapset *validsubplans; > + Bitmapset *parentrelids; > > ... > if (pruneinfo && pruneinfo->contains_init_steps) > { > int i; > ... > return false; > } > } > break; > > Most of the declarations need to be moved inside the if-block. Done. > Also, initially, I was a bit concerned regarding this code block > inside GetLockableRelations_worker(), what if (pruneinfo && > pruneinfo->contains_init_steps) evaluated to false? After debugging I > realized that plan_tree_walker() will do the needful -- a bit of > comment would have helped. You're right. Added a dummy else {} block with just the comment saying so. > + case T_CustomScan: > + foreach(lc, ((CustomScan *) plan)->custom_plans) > + { > + if (walker((Plan *) lfirst(lc), context)) > + return true; > + } > + break; > > Why not plan_walk_members() call like other nodes? Makes sense, done. Again, most/all of this patch might need to be thrown away, but here it is anyway. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 14, 2022 at 11:10 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote: > > Here are few comments for v1 patch: > > Thanks Amul. I'm thinking about Robert's latest comments addressing > which may need some rethinking of this whole design, but I decided to > post a v2 taking care of your comments. cfbot tells me there is an unused variable warning, which is fixed in the attached v3. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote: > This is just a relatively simple example and I think there are > probably a bunch of others. There are a lot of kinds of DDL that could > be performed on a partition that gets pruned away: DROP INDEX is just > one example. I haven't followed this in any detail, but this patch and its goal of reducing the O(N) drag effect on partition execution time is very important. Locking a long list of objects that then get pruned is very wasteful, as the results show. Ideally, we want an O(1) algorithm for single partition access and DDL is rare. So perhaps that is the starting point for a safe design - invent a single lock or cache that allows us to check if the partition hierarchy has changed in any way, and if so, replan, if not, skip locks. Please excuse me if this idea falls short, if so, please just note my comment about how important this is. Thanks. -- Simon Riggs http://www.EnterpriseDB.com/
Hi Simon, On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote: > > This is just a relatively simple example and I think there are > > probably a bunch of others. There are a lot of kinds of DDL that could > > be performed on a partition that gets pruned away: DROP INDEX is just > > one example. > > I haven't followed this in any detail, but this patch and its goal of > reducing the O(N) drag effect on partition execution time is very > important. Locking a long list of objects that then get pruned is very > wasteful, as the results show. > > Ideally, we want an O(1) algorithm for single partition access and DDL > is rare. So perhaps that is the starting point for a safe design - > invent a single lock or cache that allows us to check if the partition > hierarchy has changed in any way, and if so, replan, if not, skip > locks. Rearchitecting partition locking to be O(1) seems like a project of non-trivial complexity as Robert mentioned in a related email thread couple of years ago: https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com Pursuing that kind of a project would perhaps have been more worthwhile if the locking issue had affected more than just this particular case, that is, the case of running prepared statements over partitioned tables using generic plans. Addressing this by rearchitecting run-time pruning (and plancache to some degree) seemed like it might lead to this getting fixed in a bounded timeframe. I admit that the concerns that Robert has raised about the patch make me want to reconsider that position, though maybe it's too soon to conclude. -- Amit Langote EDB: http://www.enterprisedb.com
On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote: > > Hi Simon, > > On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote: > > > This is just a relatively simple example and I think there are > > > probably a bunch of others. There are a lot of kinds of DDL that could > > > be performed on a partition that gets pruned away: DROP INDEX is just > > > one example. > > > > I haven't followed this in any detail, but this patch and its goal of > > reducing the O(N) drag effect on partition execution time is very > > important. Locking a long list of objects that then get pruned is very > > wasteful, as the results show. > > > > Ideally, we want an O(1) algorithm for single partition access and DDL > > is rare. So perhaps that is the starting point for a safe design - > > invent a single lock or cache that allows us to check if the partition > > hierarchy has changed in any way, and if so, replan, if not, skip > > locks. > > Rearchitecting partition locking to be O(1) seems like a project of > non-trivial complexity as Robert mentioned in a related email thread > couple of years ago: > > https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com I agree, completely redesigning locking is a major project. But that isn't what I suggested, which was to find an O(1) algorithm to solve the safety issue. I'm sure there is an easy way to check one lock, maybe a new one/new kind, rather than N. Why does the safety issue exist? Why is it important to be able to concurrently access parts of the hierarchy with DDL? Those are not critical points. If we asked them, most users would trade a 10x performance gain for some restrictions on DDL. If anyone cares, make it an option, but most people will use it. Maybe force all DDL, or just DDL that would cause safety issues, to update a hierarchy version number, so queries can tell whether they need to replan. Don't know, just looking for an O(1) solution. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote: > Pursuing that kind of a project would perhaps have been more > worthwhile if the locking issue had affected more than just this > particular case, that is, the case of running prepared statements over > partitioned tables using generic plans. Addressing this by > rearchitecting run-time pruning (and plancache to some degree) seemed > like it might lead to this getting fixed in a bounded timeframe. I > admit that the concerns that Robert has raised about the patch make me > want to reconsider that position, though maybe it's too soon to > conclude. I wasn't trying to say that your approach was dead in the water. It does create a situation that can't happen today, and such things are scary and need careful thought. But redesigning the locking mechanism would need careful thought, too ... maybe even more of it than sorting this out. I do also agree with Simon that this is an important problem to which we need to find some solution. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 7:28 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs > > <simon.riggs@enterprisedb.com> wrote: > > > I haven't followed this in any detail, but this patch and its goal of > > > reducing the O(N) drag effect on partition execution time is very > > > important. Locking a long list of objects that then get pruned is very > > > wasteful, as the results show. > > > > > > Ideally, we want an O(1) algorithm for single partition access and DDL > > > is rare. So perhaps that is the starting point for a safe design - > > > invent a single lock or cache that allows us to check if the partition > > > hierarchy has changed in any way, and if so, replan, if not, skip > > > locks. > > > > Rearchitecting partition locking to be O(1) seems like a project of > > non-trivial complexity as Robert mentioned in a related email thread > > couple of years ago: > > > > https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com > > I agree, completely redesigning locking is a major project. But that > isn't what I suggested, which was to find an O(1) algorithm to solve > the safety issue. I'm sure there is an easy way to check one lock, > maybe a new one/new kind, rather than N. I misread your email then, sorry. > Why does the safety issue exist? Why is it important to be able to > concurrently access parts of the hierarchy with DDL? Those are not > critical points. > > If we asked them, most users would trade a 10x performance gain for > some restrictions on DDL. If anyone cares, make it an option, but most > people will use it. > > Maybe force all DDL, or just DDL that would cause safety issues, to > update a hierarchy version number, so queries can tell whether they > need to replan. Don't know, just looking for an O(1) solution. Yeah, it would be great if it would suffice to take a single lock on the partitioned table mentioned in the query, rather than on all elements of the partition tree added to the plan. AFAICS, ways to get that are 1) Prevent modifying non-root partition tree elements, 2) Make it so that locking a partitioned table becomes a proxy for having locked all of its descendents, 3) Invent a Plan representation for scanning partitioned tables such that adding the descendent tables that survive plan-time pruning to the plan doesn't require locking them too. IIUC, you've mentioned 1 and 2. I think I've seen 3 mentioned in the past discussions on this topic, but I guess the research on whether that's doable has never been done. -- Amit Langote EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Pursuing that kind of a project would perhaps have been more > > worthwhile if the locking issue had affected more than just this > > particular case, that is, the case of running prepared statements over > > partitioned tables using generic plans. Addressing this by > > rearchitecting run-time pruning (and plancache to some degree) seemed > > like it might lead to this getting fixed in a bounded timeframe. I > > admit that the concerns that Robert has raised about the patch make me > > want to reconsider that position, though maybe it's too soon to > > conclude. > > I wasn't trying to say that your approach was dead in the water. It > does create a situation that can't happen today, and such things are > scary and need careful thought. But redesigning the locking mechanism > would need careful thought, too ... maybe even more of it than sorting > this out. Yes, agreed. -- Amit Langote EDB: http://www.enterprisedb.com
On Wed, 19 Jan 2022 at 08:31, Amit Langote <amitlangote09@gmail.com> wrote: > > Maybe force all DDL, or just DDL that would cause safety issues, to > > update a hierarchy version number, so queries can tell whether they > > need to replan. Don't know, just looking for an O(1) solution. > > Yeah, it would be great if it would suffice to take a single lock on > the partitioned table mentioned in the query, rather than on all > elements of the partition tree added to the plan. AFAICS, ways to get > that are 1) Prevent modifying non-root partition tree elements, Can we reuse the concept of Strong/Weak locking here? When a DDL request is in progress (for that partitioned table), take all required locks for safety. When a DDL request is not in progress, take minimal locks knowing it is safe. We can take a single PartitionTreeModificationLock, nowait to prove that we do not need all locks. DDL would request the lock in exclusive mode. (Other mechanisms possible). -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jan 13, 2022 at 3:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Or, maybe this won't be a concern if performing ExecutorStart() is > > made a part of CheckCachedPlan() somehow, which would then take locks > > on the relation as the PlanState tree is built capturing any plan > > invalidations, instead of AcquireExecutorLocks(). That does sound like > > an ambitious undertaking though. > > On the surface that would seem to involve abstraction violations, but > maybe that could be finessed somehow. The plancache shouldn't know too > much about what the executor is going to do with the plan, but it > could ask the executor to perform a step that has been designed for > use by the plancache. I guess the core problem here is how to pass > around information that is node-specific before we've stood up the > executor state tree. Maybe the executor could have a function that > does the pruning and returns some kind of array of results that can be > used both to decide what to lock and also what to consider as pruned > at the start of execution. (I'm hand-waving about the details because > I don't know.) The attached patch implements this idea. Sorry for the delay in getting this out and thanks to Robert for the off-list discussions on this. So the new executor "step" you mention is the function ExecutorPrep in the patch, which calls a recursive function ExecPrepNode on the plan tree's top node, much as ExecutorStart calls (via InitPlan) ExecInitNode to construct a PlanState tree for actual execution paralleling the plan tree. For now, ExecutorPrep() / ExecPrepNode() does mainly two things if and as it walks the plan tree: 1) Extract the RT indexes of RTE_RELATION entries and add them to a bitmapset in the result struct, 2) If the node contains a PartitionPruneInfo, perform its "initial pruning steps" and store the result of doing so in a per-plan-node node called PlanPrepOutput. The bitmapset and the array containing per-plan-node PlanPrepOutput nodes are returned in a node called ExecPrepOutput, which is the result of ExecutorPrep, to its calling module (say, plancache.c), which, after it's done using that information, must pass it forward to subsequent execution steps. That is done by passing it, via the module's callers, to CreateQueryDesc() which remembers the ExecPrepOutput in QueryDesc that is eventually passed to ExecutorStart(). A bunch of other details are mentioned in the patch's commit message, which I'm pasting below for anyone reading to spot any obvious flaws (no-go's) of this approach: Invent a new executor "prep" phase The new phase, implemented by execMain.c:ExecutorPrep() and its recursive underling execProcnode.c:ExecPrepNode(), takes a query's PlannedStmt and processes the plan tree contained in it to produce a ExecPrepOutput node as result. As the plan tree is walked, each node must add the RT index(es) of any relation(s) that it directly manipulates to a bitmapset member of ExecPrepOutput (for example, an IndexScan node must add the Scan's scanrelid). Also, each node may want to make a PlanPrepOutput node containing additional information that may be of interest to the calling module or to the later execution phases, if the node can provide one (for example, an Append node may perform initial pruning and add a set of "initially valid subplans" to the PlanPrepOutput). The PlanPrepOutput nodess of all the plan nodes are added to an array in the ExecPrepOutput, which is indexed using the individual nodes' plan_node_id; a NULL is stored in the array slots of nodes that don't have anything interesting to add to the PlanPrepOutput. The ExecPrepOutput thus produced is passed to CreateQueryDesc() and subsequently to ExecutorStart() via QueryDesc, which then makes it available to the executor routines via the query's EState. The main goal of adding this new phase is, for now, to allow cached cached generic plans containing scans of partitioned tables using Append/MergeAppend to be executed more efficiently by the prep phase doing any initial pruning, instead of deferring that to ExecutorStart(). That may allow AcquireExecutorLocks() on the plan to lock only only the minimal set of relations/partitions, that is those whose subplans survive the initial pruning. Implementation notes: * To allow initial pruning to be done as part of the pre-execution prep phase as opposed to as part of ExecutorStart(), this refactors ExecCreatePartitionPruneState() and ExecFindInitialMatchingSubPlans() to pass the information needed to do initial pruning directly as parameters instead of getting that from the EState and the PlanState of the parent Append/MergeAppend, both of which would not be available in ExecutorPrep(). Another, sort of non-essential-to-this- goal, refactoring this does is moving the partition pruning initialization stanzas in ExecInitAppend() and ExecInitMergeAppend() both of which contain the same cod into its own function ExecInitPartitionPruning(). * To pass the ExecPrepOutput(s) created by the plancache module's invocation of ExecutorPrep() to the callers of the module, which in turn would pass them down to ExecutorStart(), CachedPlan gets a new List field that stores those ExecPrepOutputs, containing one element for each PlannedStmt also contained in the CachedPlan. The new list is stored in a child context of the context containing the PlannedStmts, though unlike the latter, it is reset on every invocation of CheckCachedPlan(), which in turn calls ExecutorPrep() with a new set of bound Params. * AcquireExecutorLocks() is now made to loop over a bitmapset of RT indexes, those of relations returned in ExecPrepOutput, instead of over the whole range table. With initial pruning that is also done as part of ExcecutorPrep(), only relations from non-pruned nodes of the plan tree would get locked as a result of this new arrangement. * PlannedStmt gets a new field usesPrepExecPruning that indicates whether any of the nodes of the plan tree contain "initial" (or "pre-execution") pruning steps, which saves ExecutorPrep() the trouble of walking the plan tree only to find out whether that's the case. * PartitionPruneInfo nodes now explicitly stores whether the steps contained in any of the individual PartitionedRelPruneInfos embedded in it contain initial pruning steps (those that can be performed during ExecutorPrep) and execution pruning steps (those that can only be performed during ExecutorRun), as flags contains_initial_steps and contains_exec_steps, respectively. In fact, the aforementioned PlannedStmt field's value is a logical OR of the values of the former across all PartitionPruneInfo nodes embedded in the plan tree. * PlannedStmt also gets a bitmapset field to store the RT indexes of all relation RTEs referenced in the query that is populated when contructing the flat range table in setrefs.c, which effectively contains all the relations that the planner must have locked. In the case of a cached plan, AcquireExecutorLocks() must lock all of those relations, except those whose subnodes get pruned as result of ExecutorPrep(). * PlannedStmt gets yet another field numPlanNodes that records the highest plan_node_id assigned to any of the node contained in the tree, which serves as the size to use when allocating the PlanPrepOutput array. Maybe this should be more than one patch? Say: 0001 to add ExecutorPrep and the boilerplate, 0002 to teach plancache.c to use the new facility Thoughts? -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote: > Maybe this should be more than one patch? Say: > > 0001 to add ExecutorPrep and the boilerplate, > 0002 to teach plancache.c to use the new facility Could be, not sure. I agree that if it's possible to split this in a meaningful way, it would facilitate review. I notice that there is some straight code movement e.g. the creation of ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do pure code movement in a preparatory patch so that the main patch is just adding the new stuff we need and not moving stuff around. David Rowley recently proposed a patch for some parallel-safety debugging cross checks which added a plan tree walker. I'm not sure whether he's going to press that patch forward to commit, but I think we should get something like that into the tree and start using it, rather than adding more bespoke code. Maybe you/we should steal that part of his patch and commit it separately. What I'm imagining is that plan_tree_walker() would know which nodes have subnodes and how to recurse over the tree structure, and you'd have a walker function to use with it that would know which executor nodes have ExecPrep functions and call them, and just do nothing for the others. That would spare you adding stub functions for nodes that don't need to do anything, or don't need to do anything other than recurse. Admittedly it would look a bit different from the existing executor phases, but I'd argue that it's a better coding model. Actually, you might've had this in the patch at some point, because you have a declaration for plan_tree_walker but no implementation. I guess one thing that's a bit awkward about this idea is that in some cases you want to recurse to some subnodes but not other subnodes. But maybe it would work to put the recursion in the walker function in that case, and then just return true; but if you want to walk all children, return false. + bool contains_init_steps; + bool contains_exec_steps; s/steps/pruning/? maybe with contains -> needs or performs or requires as well? + * Returned information includes the set of RT indexes of relations referenced + * in the plan, and a PlanPrepOutput node for each node in the planTree if the + * node type supports producing one. Aren't all RT indexes referenced in the plan? + * This may lock relations whose information may be used to produce the + * PlanPrepOutput nodes. For example, a partitioned table before perusing its + * PartitionPruneInfo contained in an Append node to do the pruning the result + * of which is used to populate the Append node's PlanPrepOutput. "may lock" feels awfully fuzzy to me. How am I supposed to rely on something that "may" happen? And don't we need to have tight logic around locking, with specific guarantees about what is locked at which points in the code and what is not? + * At least one of 'planstate' or 'econtext' must be passed to be able to + * successfully evaluate any non-Const expressions contained in the + * steps. This also seems fuzzy. If I'm thinking of calling this function, I don't know how I'd know whether this criterion is met. I don't love PlanPrepOutput the way you have it. I think one of the basic design issues for this patch is: should we think of the prep phase as specifically pruning, or is it general prep and pruning is the first thing for which we're going to use it? If it's really a pre-pruning phase, we could name it that way instead of calling it "prep". If it's really a general prep phase, then why does PlanPrepOutput contain initially_valid_subnodes as a field? One could imagine letting each prep function decide what kind of prep node it would like to return, with partition pruning being just one of the options. But is that a useful generalization of the basic concept, or just pretending that a special-purpose mechanism is more general than it really is? + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */ It seems to me that we should do what the XXX suggests. It doesn't seem nice if the parallel workers could theoretically decide to prune a different set of nodes than the leader. + * known at executor startup (excludeing expressions containing Extra e. + * into subplan indexes, is also returned for use during subsquent Missing e. Somewhere, we're going to need to document the idea that this may permit us to execute a plan that isn't actually fully valid, but that we expect to survive because we'll never do anything with the parts of it that aren't. Maybe that should be added to the executor README, or maybe there's some better place, but I don't think that should remain something that's just implicit. This is not a full review, just some initial thoughts looking through this. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-10 17:13:52 +0900, Amit Langote wrote: > The attached patch implements this idea. Sorry for the delay in > getting this out and thanks to Robert for the off-list discussions on > this. I did not follow this thread at all. And I only skimmed the patch. So I'm probably wrong. I'm a wary of this increasing executor overhead even in cases it won't help. Without this patch, for simple queries, I see small allocations noticeably in profiles. This adds a bunch more, even if !context->stmt->usesPreExecPruning: - makeNode(ExecPrepContext) - makeNode(ExecPrepOutput) - palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes) - stmt_execprep_list = lappend(stmt_execprep_list, execprep); - AllocSetContextCreate(CurrentMemoryContext, "CachedPlan execprep list", ... - ... That's a lot of extra for something that's already a bottleneck. Greetings, Andres Freund
(just catching up on this thread) On Thu, 13 Jan 2022 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote: > Yeah. I don't think it's only non-core code we need to worry about > either. What if I just do EXPLAIN ANALYZE on a prepared query that > ends up pruning away some stuff? IIRC, the pruned subplans are not > shown, so we might escape disaster here, but FWIW if I'd committed > that code I would have pushed hard for showing those and saying "(not > executed)" .... so it's not too crazy to imagine a world in which > things work that way. FWIW, that would remove the whole point in init run-time pruning. The reason I made two phases of run-time pruning was so that we could get away from having the init plan overhead of nodes we'll never need to scan. If we wanted to show the (never executed) scans in EXPLAIN then we'd need to do the init plan part and allocate all that memory needlessly. Imagine a hash partitioned table on "id" with 1000 partitions. The user does: PREPARE q1 (INT) AS SELECT * FROM parttab WHERE id = $1; EXECUTE q1(123); Assuming a generic plan, if we didn't have init pruning then we have to build a plan containing the scans for all 1000 partitions. There's significant overhead to that compared to just locking the partitions, and initialising 1 scan. If it worked this way then we'd be even further from Amit's goal of reducing the overhead of starting plan with run-time pruning nodes. I understood at the time it was just the EXPLAIN output that you had concerns with. I thought that was just around the lack of any display of the condition we used for pruning. David
On Sun, Feb 13, 2022 at 4:55 PM David Rowley <dgrowleyml@gmail.com> wrote: > FWIW, that would remove the whole point in init run-time pruning. The > reason I made two phases of run-time pruning was so that we could get > away from having the init plan overhead of nodes we'll never need to > scan. If we wanted to show the (never executed) scans in EXPLAIN then > we'd need to do the init plan part and allocate all that memory > needlessly. Interesting. I didn't realize that was why it had ended up like this. > I understood at the time it was just the EXPLAIN output that you had > concerns with. I thought that was just around the lack of any display > of the condition we used for pruning. That was part of it, but I did think it was surprising that we didn't print anything at all about the nodes we pruned, too. Although we're technically iterating over the PlanState, from the user perspective it feels like you're asking PostgreSQL to print out the plan - so it seems weird to have nodes in the Plan tree that are quietly omitted from the output. That said, perhaps in retrospect it's good that it ended up as it did, since we'd have a lot of trouble printing anything sensible for a scan of a table that's since been dropped. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Andres, On Fri, Feb 11, 2022 at 10:29 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-02-10 17:13:52 +0900, Amit Langote wrote: > > The attached patch implements this idea. Sorry for the delay in > > getting this out and thanks to Robert for the off-list discussions on > > this. > > I did not follow this thread at all. And I only skimmed the patch. So I'm > probably wrong. Thanks for your interest in this and sorry about the delay in replying (have been away due to illness). > I'm a wary of this increasing executor overhead even in cases it won't > help. Without this patch, for simple queries, I see small allocations > noticeably in profiles. This adds a bunch more, even if > !context->stmt->usesPreExecPruning: Ah, if any new stuff added by the patch runs in !context->stmt->usesPreExecPruning paths, then it's just poor coding on my part, which I'm now looking to fix. Maybe not all of it is avoidable, but I think whatever isn't should be trivial... > - makeNode(ExecPrepContext) > - makeNode(ExecPrepOutput) > - palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes) > - stmt_execprep_list = lappend(stmt_execprep_list, execprep); > - AllocSetContextCreate(CurrentMemoryContext, > "CachedPlan execprep list", ... > - ... > > That's a lot of extra for something that's already a bottleneck. If all these allocations are limited to the usesPreExecPruning path, IMO, they would amount to trivial overhead compared to what is going to be avoided -- locking say 1000 partitions when only 1 will be scanned. Although, maybe there's a way to code this to have even less overhead than what's in the patch now. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Maybe this should be more than one patch? Say: > > > > 0001 to add ExecutorPrep and the boilerplate, > > 0002 to teach plancache.c to use the new facility Thanks for taking a look and sorry about the delay. > Could be, not sure. I agree that if it's possible to split this in a > meaningful way, it would facilitate review. I notice that there is > some straight code movement e.g. the creation of > ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do > pure code movement in a preparatory patch so that the main patch is > just adding the new stuff we need and not moving stuff around. Okay, created 0001 for moving around the execution pruning code. > David Rowley recently proposed a patch for some parallel-safety > debugging cross checks which added a plan tree walker. I'm not sure > whether he's going to press that patch forward to commit, but I think > we should get something like that into the tree and start using it, > rather than adding more bespoke code. Maybe you/we should steal that > part of his patch and commit it separately. I looked at the thread you mentioned (I guess [1]), though it seems David's proposing a path_tree_walker(), so I guess only useful within the planner and not here. > What I'm imagining is that > plan_tree_walker() would know which nodes have subnodes and how to > recurse over the tree structure, and you'd have a walker function to > use with it that would know which executor nodes have ExecPrep > functions and call them, and just do nothing for the others. That > would spare you adding stub functions for nodes that don't need to do > anything, or don't need to do anything other than recurse. Admittedly > it would look a bit different from the existing executor phases, but > I'd argue that it's a better coding model. > > Actually, you might've had this in the patch at some point, because > you have a declaration for plan_tree_walker but no implementation. Right, the previous patch indeed used a plan_tree_walker() for this and I think in a way you seem to think it should work. I do agree that plan_tree_walker() allows for a better implementation of the idea of this patch and may also be generally useful, so I've created a separate patch that adds it to nodeFuncs.c. > I guess one thing that's a bit awkward about this idea is that in some > cases you want to recurse to some subnodes but not other subnodes. But > maybe it would work to put the recursion in the walker function in > that case, and then just return true; but if you want to walk all > children, return false. Right, that's how I've made ExecPrepAppend() etc. do it. > + bool contains_init_steps; > + bool contains_exec_steps; > > s/steps/pruning/? maybe with contains -> needs or performs or requires as well? Went with: needs_{init|exec}_pruning > + * Returned information includes the set of RT indexes of relations referenced > + * in the plan, and a PlanPrepOutput node for each node in the planTree if the > + * node type supports producing one. > > Aren't all RT indexes referenced in the plan? Ah yes. How about: * Returned information includes the set of RT indexes of relations that must * be locked to safely execute the plan, > + * This may lock relations whose information may be used to produce the > + * PlanPrepOutput nodes. For example, a partitioned table before perusing its > + * PartitionPruneInfo contained in an Append node to do the pruning the result > + * of which is used to populate the Append node's PlanPrepOutput. > > "may lock" feels awfully fuzzy to me. How am I supposed to rely on > something that "may" happen? And don't we need to have tight logic > around locking, with specific guarantees about what is locked at which > points in the code and what is not? Agree the wording was fuzzy. I've rewrote as: * This locks relations whose information is needed to produce the * PlanPrepOutput nodes. For example, a partitioned table before perusing its * PartitionedRelPruneInfo contained in an Append node to do the pruning, the * result of which is used to populate the Append node's PlanPrepOutput. BTW, I've added an Assert in ExecGetRangeTableRelation(): /* * A cross-check that AcquireExecutorLocks() hasn't missed any relations * it must not have. */ Assert(estate->es_execprep == NULL || bms_is_member(rti, estate->es_execprep->relationRTIs)); which IOW ensures that the actual execution of a plan only sees relations that ExecutorPrep() would've told AcquireExecutorLocks() to take a lock on. > + * At least one of 'planstate' or 'econtext' must be passed to be able to > + * successfully evaluate any non-Const expressions contained in the > + * steps. > > This also seems fuzzy. If I'm thinking of calling this function, I > don't know how I'd know whether this criterion is met. OK, I have removed this comment (which was on top of a static local function) in favor of adding some commentary on this in places where it belongs. For example, in ExecPrepDoInitialPruning(): /* * We don't yet have a PlanState for the parent plan node, so must create * a standalone ExprContext to evaluate pruning expressions, equipped with * the information about the EXTERN parameters that the caller passed us. * Note that that's okay because the initial pruning steps does not * involve anything that requires the execution to have started. */ econtext = CreateStandaloneExprContext(); econtext->ecxt_param_list_info = params; prunestate = ExecCreatePartitionPruneState(NULL, pruneinfo, true, false, rtable, econtext, pdir, parentrelids); > I don't love PlanPrepOutput the way you have it. I think one of the > basic design issues for this patch is: should we think of the prep > phase as specifically pruning, or is it general prep and pruning is > the first thing for which we're going to use it? If it's really a > pre-pruning phase, we could name it that way instead of calling it > "prep". If it's really a general prep phase, then why does > PlanPrepOutput contain initially_valid_subnodes as a field? One could > imagine letting each prep function decide what kind of prep node it > would like to return, with partition pruning being just one of the > options. But is that a useful generalization of the basic concept, or > just pretending that a special-purpose mechanism is more general than > it really is? While it can feel like the latter TBH, I'm inclined to keep ExecutorPrep generalized. What bothers me about about the alternative of calling the new phase something less generalized like ExecutorDoInitPruning() is that that makes the somewhat elaborate API changes needed for the phase's output to put into QueryDesc, through which it ultimately reaches the main executor, seem less worthwhile. I agree that PlanPrepOutput design needs to be likewise generalized, maybe like you suggest -- using PlanInitPruningOutput, a child class of PlanPrepOutput, to return the prep output for plan nodes that support pruning. Thoughts? > + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */ > > It seems to me that we should do what the XXX suggests. It doesn't > seem nice if the parallel workers could theoretically decide to prune > a different set of nodes than the leader. OK, will fix. > + * known at executor startup (excludeing expressions containing > > Extra e. > > + * into subplan indexes, is also returned for use during subsquent > > Missing e. Will fix. > Somewhere, we're going to need to document the idea that this may > permit us to execute a plan that isn't actually fully valid, but that > we expect to survive because we'll never do anything with the parts of > it that aren't. Maybe that should be added to the executor README, or > maybe there's some better place, but I don't think that should remain > something that's just implicit. Agreed. I'd added a description of the new prep phase to executor README, though the text didn't mention this particular bit. Will fix to mention it. > This is not a full review, just some initial thoughts looking through this. Thanks again. Will post a new version soon after a bit more polishing. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/flat/b59605fecb20ba9ea94e70ab60098c237c870628.camel%40postgrespro.ru
On Mon, Mar 7, 2022 at 11:18 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I don't love PlanPrepOutput the way you have it. I think one of the > > basic design issues for this patch is: should we think of the prep > > phase as specifically pruning, or is it general prep and pruning is > > the first thing for which we're going to use it? If it's really a > > pre-pruning phase, we could name it that way instead of calling it > > "prep". If it's really a general prep phase, then why does > > PlanPrepOutput contain initially_valid_subnodes as a field? One could > > imagine letting each prep function decide what kind of prep node it > > would like to return, with partition pruning being just one of the > > options. But is that a useful generalization of the basic concept, or > > just pretending that a special-purpose mechanism is more general than > > it really is? > > While it can feel like the latter TBH, I'm inclined to keep > ExecutorPrep generalized. What bothers me about about the > alternative of calling the new phase something less generalized like > ExecutorDoInitPruning() is that that makes the somewhat elaborate API > changes needed for the phase's output to put into QueryDesc, through > which it ultimately reaches the main executor, seem less worthwhile. > > I agree that PlanPrepOutput design needs to be likewise generalized, > maybe like you suggest -- using PlanInitPruningOutput, a child class > of PlanPrepOutput, to return the prep output for plan nodes that > support pruning. > > Thoughts? So I decided to agree with you after all about limiting the scope of this new executor interface, or IOW call it what it is. I have named it ExecutorGetLockRels() to go with the only use case we know for it -- get the set of relations for AcquireExecutorLocks() to lock to validate a plan tree. Its result returned in a node named ExecLockRelsInfo, which contains the set of relations scanned in the plan tree (lockrels) and a list of PlanInitPruningOutput nodes for all nodes that undergo pruning. > > + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */ > > > > It seems to me that we should do what the XXX suggests. It doesn't > > seem nice if the parallel workers could theoretically decide to prune > > a different set of nodes than the leader. > > OK, will fix. Done. This required adding nodeToString() and stringToNode() support for the nodes produced by the new executor function that wasn't there before. > > Somewhere, we're going to need to document the idea that this may > > permit us to execute a plan that isn't actually fully valid, but that > > we expect to survive because we'll never do anything with the parts of > > it that aren't. Maybe that should be added to the executor README, or > > maybe there's some better place, but I don't think that should remain > > something that's just implicit. > > Agreed. I'd added a description of the new prep phase to executor > README, though the text didn't mention this particular bit. Will fix > to mention it. Rewrote the comments above ExecutorGetLockRels() (previously ExecutorPrep()) and the executor README text to be explicit about the fact that not locking some relations effectively invalidates pruned parts of the plan tree. > > This is not a full review, just some initial thoughts looking through this. > > Thanks again. Will post a new version soon after a bit more polishing. Attached is v5, now broken into 3 patches: 0001: Some refactoring of runtime pruning code 0002: Add a plan_tree_walker 0003: Teach AcquireExecutorLocks to skip locking pruned relations -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, Mar 11, 2022 at 11:35 PM Amit Langote <amitlangote09@gmail.com> wrote: > Attached is v5, now broken into 3 patches: > > 0001: Some refactoring of runtime pruning code > 0002: Add a plan_tree_walker > 0003: Teach AcquireExecutorLocks to skip locking pruned relations Repeated the performance tests described in the 1st email of this thread: HEAD: (copied from the 1st email) 32 tps = 20561.776403 (without initial connection time) 64 tps = 12553.131423 (without initial connection time) 128 tps = 13330.365696 (without initial connection time) 256 tps = 8605.723120 (without initial connection time) 512 tps = 4435.951139 (without initial connection time) 1024 tps = 2346.902973 (without initial connection time) 2048 tps = 1334.680971 (without initial connection time) Patched v1: (copied from the 1st email) 32 tps = 27554.156077 (without initial connection time) 64 tps = 27531.161310 (without initial connection time) 128 tps = 27138.305677 (without initial connection time) 256 tps = 25825.467724 (without initial connection time) 512 tps = 19864.386305 (without initial connection time) 1024 tps = 18742.668944 (without initial connection time) 2048 tps = 16312.412704 (without initial connection time) Patched v5: 32 tps = 28204.197738 (without initial connection time) 64 tps = 26795.385318 (without initial connection time) 128 tps = 26387.920550 (without initial connection time) 256 tps = 25601.141556 (without initial connection time) 512 tps = 19911.947502 (without initial connection time) 1024 tps = 20158.692952 (without initial connection time) 2048 tps = 16180.195463 (without initial connection time) Good to see that these rewrites haven't really hurt the numbers much, which makes sense because the rewrites have really been about putting the code in the right place. BTW, these are the numbers for the same benchmark repeated with plan_cache_mode = auto, which causes a custom plan to be chosen for every execution and so unaffected by this patch. 32 tps = 13359.225082 (without initial connection time) 64 tps = 15760.533280 (without initial connection time) 128 tps = 15825.734482 (without initial connection time) 256 tps = 15017.693905 (without initial connection time) 512 tps = 13479.973395 (without initial connection time) 1024 tps = 13200.444397 (without initial connection time) 2048 tps = 12884.645475 (without initial connection time) Comparing them to numbers when using force_generic_plan shows that making the generic plans faster is indeed worthwhile. -- Amit Langote EDB: http://www.enterprisedb.com
Hi,
w.r.t. v5-0003-Teach-AcquireExecutorLocks-to-skip-locking-pruned.patch :
(pruning steps containing expressions that can be computed before
before the executor proper has started)
before the executor proper has started)
the word 'before' was repeated.
For ExecInitParallelPlan():
+ char *execlockrelsinfo_data;
+ char *execlockrelsinfo_space;
+ char *execlockrelsinfo_space;
the content of execlockrelsinfo_data is copied into execlockrelsinfo_space.
I wonder if having one of execlockrelsinfo_data and execlockrelsinfo_space suffices.
Cheers
On Fri, Mar 11, 2022 at 9:35 AM Amit Langote <amitlangote09@gmail.com> wrote: > Attached is v5, now broken into 3 patches: > > 0001: Some refactoring of runtime pruning code > 0002: Add a plan_tree_walker > 0003: Teach AcquireExecutorLocks to skip locking pruned relations So is any other committer planning to look at this? Tom, perhaps? David? This strikes me as important work, and I don't mind going through and trying to do some detailed review, but (A) I am not the person most familiar with the code being modified here and (B) there are some important theoretical questions about the approach that we might want to try to cover before we get down into the details. In my opinion, the most important theoretical issue here is around reuse of plans that are no longer entirely valid, but the parts that are no longer valid are certain to be pruned. If, because we know that some parameter has some particular value, we skip locking a bunch of partitions, then when we're executing the plan, those partitions need not exist any more -- or they could have different indexes, be detached from the partitioning hierarchy and subsequently altered, whatever. That seems fine to me provided that all of our code (and any third-party code) is careful not to rely on the portion of the plan that we've pruned away, and doesn't assume that (for example) we can still fetch the name of an index whose OID appears in there someplace. I cannot think of a hazard where the fact that the part of a plan is no longer valid because some DDL has been executed "infects" the remainder of the plan. As long as we lock the partitioned tables named in the plan and their descendents down to the level just above the one at which something is pruned, and are careful, I think we should be OK. It would be nice to know if someone has a fundamentally different view of the hazards here, though. Just to state my position here clearly, I would be more than happy if somebody else plans to pick this up and try to get some or all of it committed, and will cheerfully defer to such person in the event that they have that plan. If, however, no such person exists, I may try my hand at that myself. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > In my opinion, the most important theoretical issue here is around > reuse of plans that are no longer entirely valid, but the parts that > are no longer valid are certain to be pruned. If, because we know that > some parameter has some particular value, we skip locking a bunch of > partitions, then when we're executing the plan, those partitions need > not exist any more -- or they could have different indexes, be > detached from the partitioning hierarchy and subsequently altered, > whatever. Check. > That seems fine to me provided that all of our code (and any > third-party code) is careful not to rely on the portion of the plan > that we've pruned away, and doesn't assume that (for example) we can > still fetch the name of an index whose OID appears in there someplace. ... like EXPLAIN, for example? If "pruning" means physical removal from the plan tree, then it's probably all right. However, it looks to me like that doesn't actually happen, or at least doesn't happen till much later, so there's room for worry about a disconnect between what plancache.c has verified and what executor startup will try to touch. As you say, in the absence of any bugs, that's not a problem ... but if there are such bugs, tracking them down would be really hard. What I am skeptical about is that this work actually accomplishes anything under real-world conditions. That's because if pruning would save enough to make skipping the lock-acquisition phase worth the trouble, the plan cache is almost certainly going to decide it should be using a custom plan not a generic plan. Now if we had a better cost model (or, indeed, any model at all) for run-time pruning effects then maybe that situation could be improved. I think we'd be better served to worry about that end of it before we spend more time making the executor even less predictable. Also, while I've not spent much time at all reading this patch, it seems rather desperately undercommented, and a lot of the new names are unintelligible. In particular, I suspect that the patch is significantly redesigning when/where run-time pruning happens (unless it's just letting that be run twice); but I don't see any documentation or name changes suggesting where that responsibility is now. regards, tom lane
On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... like EXPLAIN, for example? Exactly! I think that's the foremost example, but extension modules like auto_explain or even third-party extensions are also a risk. I think there was some discussion of this previously. > If "pruning" means physical removal from the plan tree, then it's > probably all right. However, it looks to me like that doesn't > actually happen, or at least doesn't happen till much later, so > there's room for worry about a disconnect between what plancache.c > has verified and what executor startup will try to touch. As you > say, in the absence of any bugs, that's not a problem ... but if > there are such bugs, tracking them down would be really hard. Surgery on the plan would violate the general principle that plans are read only once constructed. I think the idea ought to be to pass a secondary data structure around with the plan that defines which parts you must ignore. Any code that fails to use that other data structure in the appropriate manner gets defined to be buggy and has to be fixed by making it follow the new rules. > What I am skeptical about is that this work actually accomplishes > anything under real-world conditions. That's because if pruning would > save enough to make skipping the lock-acquisition phase worth the > trouble, the plan cache is almost certainly going to decide it should > be using a custom plan not a generic plan. Now if we had a better > cost model (or, indeed, any model at all) for run-time pruning effects > then maybe that situation could be improved. I think we'd be better > served to worry about that end of it before we spend more time making > the executor even less predictable. I don't agree with that analysis, because setting plan_cache_mode is not uncommon. Even if that GUC didn't exist, I'm pretty sure there are cases where the planner naturally falls into a generic plan anyway, even though pruning is happening. But as it is, the GUC does exist, and people use it. Consequently, while I'd love to see something done about the costing side of things, I do not accept that all other improvements should wait for that to happen. > Also, while I've not spent much time at all reading this patch, > it seems rather desperately undercommented, and a lot of the > new names are unintelligible. In particular, I suspect that the > patch is significantly redesigning when/where run-time pruning > happens (unless it's just letting that be run twice); but I don't > see any documentation or name changes suggesting where that > responsibility is now. I am sympathetic to that concern. I spent a while staring at a baffling comment in 0001 only to discover it had just been moved from elsewhere. I really don't feel that things in this are as clear as they could be -- although I hasten to add that I respect the people who have done work in this area previously and am grateful for what they did. It's been a huge benefit to the project in spite of the bumps in the road. Moreover, this isn't the only code in PostgreSQL that needs improvement, or the worst. That said, I do think there are problems. I don't yet have a position on whether this patch is making that better or worse. That said, I believe that the core idea of the patch is to optionally perform pruning before we acquire locks or spin up the main executor and then remember the decisions we made. If once the main executor is spun up we already made those decisions, then we must stick with what we decided. If not, we make those pruning decisions at the same point we do currently - more or less on demand, at the point when we'd need to know whether to descend that branch of the plan tree or not. I think this scheme comes about because there are a couple of different interfaces to the parameterized query stuff, and in some code paths we have the values early enough to use them for pre-pruning, and in others we don't. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > What I am skeptical about is that this work actually accomplishes > > anything under real-world conditions. That's because if pruning would > > save enough to make skipping the lock-acquisition phase worth the > > trouble, the plan cache is almost certainly going to decide it should > > be using a custom plan not a generic plan. Now if we had a better > > cost model (or, indeed, any model at all) for run-time pruning effects > > then maybe that situation could be improved. I think we'd be better > > served to worry about that end of it before we spend more time making > > the executor even less predictable. > > I don't agree with that analysis, because setting plan_cache_mode is > not uncommon. Even if that GUC didn't exist, I'm pretty sure there are > cases where the planner naturally falls into a generic plan anyway, > even though pruning is happening. But as it is, the GUC does exist, > and people use it. Consequently, while I'd love to see something done > about the costing side of things, I do not accept that all other > improvements should wait for that to happen. I agree that making generic plans execute faster has merit even before we make the costing changes to allow plancache.c prefer generic plans over custom ones in these cases. As the numbers in my previous email show, simply executing a generic plan with the proposed improvements applied is significantly cheaper than having the planner do the pruning on every execution: nparts auto/custom generic ====== ========== ====== 32 13359 28204 64 15760 26795 128 15825 26387 256 15017 25601 512 13479 19911 1024 13200 20158 2048 12884 16180 > > Also, while I've not spent much time at all reading this patch, > > it seems rather desperately undercommented, and a lot of the > > new names are unintelligible. In particular, I suspect that the > > patch is significantly redesigning when/where run-time pruning > > happens (unless it's just letting that be run twice); but I don't > > see any documentation or name changes suggesting where that > > responsibility is now. > > I am sympathetic to that concern. I spent a while staring at a > baffling comment in 0001 only to discover it had just been moved from > elsewhere. I really don't feel that things in this are as clear as > they could be -- although I hasten to add that I respect the people > who have done work in this area previously and am grateful for what > they did. It's been a huge benefit to the project in spite of the > bumps in the road. Moreover, this isn't the only code in PostgreSQL > that needs improvement, or the worst. That said, I do think there are > problems. I don't yet have a position on whether this patch is making > that better or worse. Okay, I'd like to post a new version with the comments edited to make them a bit more intelligible. I understand that the comments around the new invocation mode(s) of runtime pruning are not as clear as they should be, especially as the changes that this patch wants to make to how things work are not very localized. > That said, I believe that the core idea of the patch is to optionally > perform pruning before we acquire locks or spin up the main executor > and then remember the decisions we made. If once the main executor is > spun up we already made those decisions, then we must stick with what > we decided. If not, we make those pruning decisions at the same point > we do currently Right. The "initial" pruning, that this patch wants to make occur at an earlier point (plancache.c), is currently performed in ExecInit[Merge]Append(). If it does occur early due to the plan being a cached one, ExecInit[Merge]Append() simply refers to its result that would be made available via a new data structure that plancache.c has been made to pass down to the executor alongside the plan tree. If it does not, ExecInit[Merge]Append() does the pruning in the same way it does now. Such cases include initial pruning using only STABLE expressions that the planner doesn't bother to compute by itself lest the resulting plan may be cached, but no EXTERN parameters. -- Amit Langote EDB: http://www.enterprisedb.com
On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Also, while I've not spent much time at all reading this patch, > > > it seems rather desperately undercommented, and a lot of the > > > new names are unintelligible. In particular, I suspect that the > > > patch is significantly redesigning when/where run-time pruning > > > happens (unless it's just letting that be run twice); but I don't > > > see any documentation or name changes suggesting where that > > > responsibility is now. > > > > I am sympathetic to that concern. I spent a while staring at a > > baffling comment in 0001 only to discover it had just been moved from > > elsewhere. I really don't feel that things in this are as clear as > > they could be -- although I hasten to add that I respect the people > > who have done work in this area previously and am grateful for what > > they did. It's been a huge benefit to the project in spite of the > > bumps in the road. Moreover, this isn't the only code in PostgreSQL > > that needs improvement, or the worst. That said, I do think there are > > problems. I don't yet have a position on whether this patch is making > > that better or worse. > > Okay, I'd like to post a new version with the comments edited to make > them a bit more intelligible. I understand that the comments around > the new invocation mode(s) of runtime pruning are not as clear as they > should be, especially as the changes that this patch wants to make to > how things work are not very localized. Actually, another area where the comments may not be as clear as they should have been is the changes that the patch makes to the AcquireExecutorLocks() logic that decides which relations are locked to safeguard the plan tree for execution, which are those given by RTE_RELATION entries in the range table. Without the patch, they are found by actually scanning the range table. With the patch, it's the same set of RTEs if the plan doesn't contain any pruning nodes, though instead of the range table, what is scanned is a bitmapset of their RT indexes that is made available by the planner in the form of PlannedStmt.lockrels. When the plan does contain a pruning node (PlannedStmt.containsInitialPruning), the bitmapset is constructed by calling ExecutorGetLockRels() on the plan tree, which walks it to add RT indexes of relations mentioned in the Scan nodes, while skipping any nodes that are pruned after performing initial pruning steps that may be present in their containing parent node's PartitionPruneInfo. Also, the RT indexes of partitioned tables that are present in the PartitionPruneInfo itself are also added to the set. While expanding comments added by the patch to make this clear, I realized that there are two problems, one of them quite glaring: * Planner's constructing this bitmapset and its copying along with the PlannedStmt is pure overhead in the cases that this patch has nothing to do with, which is the kind of thing that Andres cautioned against upthread. * Not all partitioned tables that would have been locked without the patch to come up with a Append/MergeAppend plan may be returned by ExecutorGetLockRels(). For example, if none of the query's runtime-prunable quals were found to match the partition key of an intermediate partitioned table and thus that partitioned table not included in the PartitionPruneInfo. Or if an Append/MergeAppend covering a partition tree doesn't contain any PartitionPruneInfo to begin with, in which case, only the leaf partitions and none of partitioned parents would be accounted for by the ExecutorGetLockRels() logic. The 1st one seems easy to fix by not inventing PlannedStmt.lockrels and just doing what's being done now: scan the range table if (!PlannedStmt.containsInitialPruning). The only way perhaps to fix the second one is to reconsider the decision we made in the following commit: commit 52ed730d511b7b1147f2851a7295ef1fb5273776 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun Oct 7 14:33:17 2018 -0400 Remove some unnecessary fields from Plan trees. In the wake of commit f2343653f, we no longer need some fields that were used before to control executor lock acquisitions: * PlannedStmt.nonleafResultRelations can go away entirely. * partitioned_rels can go away from Append, MergeAppend, and ModifyTable. However, ModifyTable still needs to know the RT index of the partition root table if any, which was formerly kept in the first entry of that list. Add a new field "rootRelation" to remember that. rootRelation is partly redundant with nominalRelation, in that if it's set it will have the same value as nominalRelation. However, the latter field has a different purpose so it seems best to keep them distinct. That is, add back the partitioned_rels field, at least to Append and MergeAppend, to store the RT indexes of partitioned tables whose children's paths are present in Append/MergeAppend.subpaths. Thoughts? -- Amit Langote EDB: http://www.enterprisedb.com
On Tue, Mar 22, 2022 at 9:44 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Also, while I've not spent much time at all reading this patch, > > > > it seems rather desperately undercommented, and a lot of the > > > > new names are unintelligible. In particular, I suspect that the > > > > patch is significantly redesigning when/where run-time pruning > > > > happens (unless it's just letting that be run twice); but I don't > > > > see any documentation or name changes suggesting where that > > > > responsibility is now. > > > > > > I am sympathetic to that concern. I spent a while staring at a > > > baffling comment in 0001 only to discover it had just been moved from > > > elsewhere. I really don't feel that things in this are as clear as > > > they could be -- although I hasten to add that I respect the people > > > who have done work in this area previously and am grateful for what > > > they did. It's been a huge benefit to the project in spite of the > > > bumps in the road. Moreover, this isn't the only code in PostgreSQL > > > that needs improvement, or the worst. That said, I do think there are > > > problems. I don't yet have a position on whether this patch is making > > > that better or worse. > > > > Okay, I'd like to post a new version with the comments edited to make > > them a bit more intelligible. I understand that the comments around > > the new invocation mode(s) of runtime pruning are not as clear as they > > should be, especially as the changes that this patch wants to make to > > how things work are not very localized. > > Actually, another area where the comments may not be as clear as they > should have been is the changes that the patch makes to the > AcquireExecutorLocks() logic that decides which relations are locked > to safeguard the plan tree for execution, which are those given by > RTE_RELATION entries in the range table. > > Without the patch, they are found by actually scanning the range table. > > With the patch, it's the same set of RTEs if the plan doesn't contain > any pruning nodes, though instead of the range table, what is scanned > is a bitmapset of their RT indexes that is made available by the > planner in the form of PlannedStmt.lockrels. When the plan does > contain a pruning node (PlannedStmt.containsInitialPruning), the > bitmapset is constructed by calling ExecutorGetLockRels() on the plan > tree, which walks it to add RT indexes of relations mentioned in the > Scan nodes, while skipping any nodes that are pruned after performing > initial pruning steps that may be present in their containing parent > node's PartitionPruneInfo. Also, the RT indexes of partitioned tables > that are present in the PartitionPruneInfo itself are also added to > the set. > > While expanding comments added by the patch to make this clear, I > realized that there are two problems, one of them quite glaring: > > * Planner's constructing this bitmapset and its copying along with the > PlannedStmt is pure overhead in the cases that this patch has nothing > to do with, which is the kind of thing that Andres cautioned against > upthread. > > * Not all partitioned tables that would have been locked without the > patch to come up with a Append/MergeAppend plan may be returned by > ExecutorGetLockRels(). For example, if none of the query's > runtime-prunable quals were found to match the partition key of an > intermediate partitioned table and thus that partitioned table not > included in the PartitionPruneInfo. Or if an Append/MergeAppend > covering a partition tree doesn't contain any PartitionPruneInfo to > begin with, in which case, only the leaf partitions and none of > partitioned parents would be accounted for by the > ExecutorGetLockRels() logic. > > The 1st one seems easy to fix by not inventing PlannedStmt.lockrels > and just doing what's being done now: scan the range table if > (!PlannedStmt.containsInitialPruning). The attached updated patch does it like this. > The only way perhaps to fix the second one is to reconsider the > decision we made in the following commit: > > commit 52ed730d511b7b1147f2851a7295ef1fb5273776 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sun Oct 7 14:33:17 2018 -0400 > > Remove some unnecessary fields from Plan trees. > > In the wake of commit f2343653f, we no longer need some fields that > were used before to control executor lock acquisitions: > > * PlannedStmt.nonleafResultRelations can go away entirely. > > * partitioned_rels can go away from Append, MergeAppend, and ModifyTable. > However, ModifyTable still needs to know the RT index of the partition > root table if any, which was formerly kept in the first entry of that > list. Add a new field "rootRelation" to remember that. rootRelation is > partly redundant with nominalRelation, in that if it's set it will have > the same value as nominalRelation. However, the latter field has a > different purpose so it seems best to keep them distinct. > > That is, add back the partitioned_rels field, at least to Append and > MergeAppend, to store the RT indexes of partitioned tables whose > children's paths are present in Append/MergeAppend.subpaths. And implemented this in the attached 0002 that reintroduces partitioned_rels in Append/MergeAppend nodes as a bitmapset of RT indexes. The set contains the RT indexes of partitioned ancestors whose expansion produced the leaf partitions that a given Append/MergeAppend node scans. This project needs this way of knowing the partitioned tables involved in producing an Append/MergeAppend node, because we'd like to give plancache.c the ability to glean the set of relations to be locked by scanning a plan tree to make the tree ready for execution rather than by scanning the range table and the only relations we're missing in the tree right now are partitioned tables. One fly-in-the-ointment situation I faced when doing that is the fact that setrefs.c in most situations removes the Append/MergeAppend from the final plan if it contains only one child subplan. I got around it by inventing a PlannerGlobal/PlannedStmt.elidedAppendPartedRels set which is a union of partitioned_rels of all the Append/MergeAppend nodes in the plan tree that were removed as described. Other than the changes mentioned above, the updated patch now contains a bit more commentary than earlier versions, mostly around AcquireExecutorLocks()'s new way of determining the set of relations to lock and the significantly redesigned working of the "initial" execution pruning. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote: > Other than the changes mentioned above, the updated patch now contains > a bit more commentary than earlier versions, mostly around > AcquireExecutorLocks()'s new way of determining the set of relations > to lock and the significantly redesigned working of the "initial" > execution pruning. Forgot to rebase over the latest HEAD, so here's v7. Also fixed that _out and _read functions for PlanInitPruningOutput were using an obsolete node label. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 28, 2022 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Other than the changes mentioned above, the updated patch now contains > > a bit more commentary than earlier versions, mostly around > > AcquireExecutorLocks()'s new way of determining the set of relations > > to lock and the significantly redesigned working of the "initial" > > execution pruning. > > Forgot to rebase over the latest HEAD, so here's v7. Also fixed that > _out and _read functions for PlanInitPruningOutput were using an > obsolete node label. Rebased. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
I'm looking at 0001 here with intention to commit later. I see that there is some resistance to 0004, but I think a final verdict on that one doesn't materially affect 0001. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El destino baraja y nosotros jugamos" (A. Schopenhauer)
On Thu, Mar 31, 2022 at 6:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I'm looking at 0001 here with intention to commit later. I see that > there is some resistance to 0004, but I think a final verdict on that > one doesn't materially affect 0001. Thanks. While the main goal of the refactoring patch is to make it easier to review the more complex changes that 0004 makes to execPartition.c, I agree it has merit on its own. Although, one may say that the bit about providing a PlanState-independent ExprContext is more closely tied with 0004's requirements... -- Amit Langote EDB: http://www.enterprisedb.com
On Thu, 31 Mar 2022 at 16:25, Amit Langote <amitlangote09@gmail.com> wrote: > Rebased. I've been looking over the v8 patch and I'd like to propose semi-baked ideas to improve things. I'd need to go and write them myself to fully know if they'd actually work ok. 1. You've changed the signature of various functions by adding ExecLockRelsInfo *execlockrelsinfo. I'm wondering why you didn't just put the ExecLockRelsInfo as a new field in PlannedStmt? I think the above gets around messing the signatures of CreateQueryDesc(), ExplainOnePlan(), pg_plan_queries(), PortalDefineQuery(), ProcessQuery() It would get rid of your change of foreach to forboth in execute_sql_string() / PortalRunMulti() and gets rid of a number of places where your carrying around a variable named execlockrelsinfo_list. It would also make the patch significantly easier to review as you'd be touching far fewer files. 2. I don't really like the way you've gone about most of the patch... The way I imagine this working is that during create_plan() we visit all nodes that have run-time pruning then inside create_append_plan() and create_merge_append_plan() we'd tag those onto a new field in PlannerGlobal That way you can store the PartitionPruneInfos in the new PlannedStmt field in standard_planner() after the makeNode(PlannedStmt). Instead of storing the PartitionPruneInfo in the Append / MergeAppend struct, you'd just add a new index field to those structs. The index would start with 0 for the 0th PartitionPruneInfo. You'd basically just know the index by assigning list_length(root->glob->partitionpruneinfos). You'd then assign the root->glob->partitionpruneinfos to PlannedStmt.partitionpruneinfos and anytime you needed to do run-time pruning during execution, you'd need to use the Append / MergeAppend's partition_prune_info_idx to lookup the PartitionPruneInfo in some new field you add to EState to store those. You'd leave that index as -1 if there's no PartitionPruneInfo for the Append / MergeAppend node. When you do AcquireExecutorLocks(), you'd iterate over the PlannedStmt's PartitionPruneInfo to figure out which subplans to prune. You'd then have an array sized list_length(plannedstmt->runtimepruneinfos) where you'd store the result. When the Append/MergeAppend node starts up you just check if the part_prune_info_idx >= 0 and if there's a non-NULL result stored then use that result. That's how you'd ensure you always got the same run-time prune result between locking and plan startup. 3. Also, looking at ExecGetLockRels(), shouldn't it be the planner's job to determine the minimum set of relations which must be locked? I think the plan tree traversal during execution not great. Seems the whole point of this patch is to reduce overhead during execution. A full additional plan traversal aside from the 3 that we already do for start/run/end of execution seems not great. I think this means that during AcquireExecutorLocks() you'd start with the minimum set or RTEs that need to be locked as determined during create_plan() and stored in some Bitmapset field in PlannedStmt. This minimal set would also only exclude RTIs that would only possibly be used due to a PartitionPruneInfo with initial pruning steps, i.e. include RTIs from PartitionPruneInfo with no init pruining steps (you can't skip any locks for those). All you need to do to determine the RTEs to lock are to take the minimal set and execute each PartitionPruneInfo in the PlannedStmt that has init steps 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting revived here. Why don't you just add a partitioned_relids to PartitionPruneInfo and just have make_partitionedrel_pruneinfo build you a Relids of them. PartitionedRelPruneInfo already has an rtindex field, so you just need to bms_add_member whatever that rtindex is. It's a fairly high-level review at this stage. I can look in more detail if the above points get looked at. You may find or know of some reason why it can't be done like I mention above. David
Thanks a lot for looking into this. On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote: > I've been looking over the v8 patch and I'd like to propose semi-baked > ideas to improve things. I'd need to go and write them myself to > fully know if they'd actually work ok. > > 1. You've changed the signature of various functions by adding > ExecLockRelsInfo *execlockrelsinfo. I'm wondering why you didn't just > put the ExecLockRelsInfo as a new field in PlannedStmt? > > I think the above gets around messing the signatures of > CreateQueryDesc(), ExplainOnePlan(), pg_plan_queries(), > PortalDefineQuery(), ProcessQuery() It would get rid of your change of > foreach to forboth in execute_sql_string() / PortalRunMulti() and gets > rid of a number of places where your carrying around a variable named > execlockrelsinfo_list. It would also make the patch significantly > easier to review as you'd be touching far fewer files. I'm worried about that churn myself and did consider this idea, though I couldn't shake the feeling that it's maybe wrong to put something in PlannedStmt that the planner itself doesn't produce. I mean the definition of PlannedStmt says this: /* ---------------- * PlannedStmt node * * The output of the planner With the ideas that you've outlined below, perhaps we can frame most of the things that the patch wants to do as the planner and the plancache changes. If we twist the above definition a bit to say what the plancache does in this regard is part of planning, maybe it makes sense to add the initial pruning related fields (nodes, outputs) into PlannedStmt. > 2. I don't really like the way you've gone about most of the patch... > > The way I imagine this working is that during create_plan() we visit > all nodes that have run-time pruning then inside create_append_plan() > and create_merge_append_plan() we'd tag those onto a new field in > PlannerGlobal That way you can store the PartitionPruneInfos in the > new PlannedStmt field in standard_planner() after the > makeNode(PlannedStmt). > > Instead of storing the PartitionPruneInfo in the Append / MergeAppend > struct, you'd just add a new index field to those structs. The index > would start with 0 for the 0th PartitionPruneInfo. You'd basically > just know the index by assigning > list_length(root->glob->partitionpruneinfos). > > You'd then assign the root->glob->partitionpruneinfos to > PlannedStmt.partitionpruneinfos and anytime you needed to do run-time > pruning during execution, you'd need to use the Append / MergeAppend's > partition_prune_info_idx to lookup the PartitionPruneInfo in some new > field you add to EState to store those. You'd leave that index as -1 > if there's no PartitionPruneInfo for the Append / MergeAppend node. > > When you do AcquireExecutorLocks(), you'd iterate over the > PlannedStmt's PartitionPruneInfo to figure out which subplans to > prune. You'd then have an array sized > list_length(plannedstmt->runtimepruneinfos) where you'd store the > result. When the Append/MergeAppend node starts up you just check if > the part_prune_info_idx >= 0 and if there's a non-NULL result stored > then use that result. That's how you'd ensure you always got the same > run-time prune result between locking and plan startup. Actually, Robert too suggested such an idea to me off-list and I think it's worth trying. I was not sure about the implementation, because then we'd be passing around lists of initial pruning nodes/results across many function/module boundaries that you mentioned in your comment 1, but if we agree that PlannedStmt is an acceptable place for those things to be stored, then I agree it's an attractive idea. > 3. Also, looking at ExecGetLockRels(), shouldn't it be the planner's > job to determine the minimum set of relations which must be locked? I > think the plan tree traversal during execution not great. Seems the > whole point of this patch is to reduce overhead during execution. A > full additional plan traversal aside from the 3 that we already do for > start/run/end of execution seems not great. > > I think this means that during AcquireExecutorLocks() you'd start with > the minimum set or RTEs that need to be locked as determined during > create_plan() and stored in some Bitmapset field in PlannedStmt. The patch did have a PlannedStmt.lockrels till v6. Though, it wasn't the same thing as you are describing it... > This > minimal set would also only exclude RTIs that would only possibly be > used due to a PartitionPruneInfo with initial pruning steps, i.e. > include RTIs from PartitionPruneInfo with no init pruining steps (you > can't skip any locks for those). All you need to do to determine the > RTEs to lock are to take the minimal set and execute each > PartitionPruneInfo in the PlannedStmt that has init steps So just thinking about an Append/MergeAppend, the minimum set must include the RT indexes of all the partitioned tables whose direct and indirect children's plans will be in 'subplans' and also of the children if the PartitionPruneInfo doesn't contain initial steps or if there is no PartitionPruneInfo to begin with. One question is whether the planner should always pay the overhead of initializing this bitmapset? I mean it's only worthwhile if AcquireExecutorLocks() is going to be involved, that is, the plan will be cached and reused. > 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting > revived here. Why don't you just add a partitioned_relids to > PartitionPruneInfo and just have make_partitionedrel_pruneinfo build > you a Relids of them. PartitionedRelPruneInfo already has an rtindex > field, so you just need to bms_add_member whatever that rtindex is. Hmm, not all Append/MergeAppend nodes in the plan tree may have make_partition_pruneinfo() called on them though. If not the proposed RelOptInfo.partitioned_rels that is populated in the early planning stages, the only reliable way to get all the partitioned tables involved in Appends/MergeAppends at create_plan() stage seems to be to make a function out the stanza at the top of make_partition_pruneinfo() that collects them by scanning the leaf paths and tracing each path's relation's parents up to the root partitioned parent and call it from create_{merge_}append_plan() if make_partition_pruneinfo() was not. I did try to implement that and found it a bit complex and expensive (the scanning the leaf paths part). > It's a fairly high-level review at this stage. I can look in more > detail if the above points get looked at. You may find or know of > some reason why it can't be done like I mention above. I'll try to write a version with the above points addressed, while keeping RelOptInfo.partitioned_rels around for now. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqH9-fAvpG-w9qYCcDWzK3vGPCMyw4f9nHzqkxXVuD1pxw%40mail.gmail.com
Amit Langote <amitlangote09@gmail.com> writes: > On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote: >> 1. You've changed the signature of various functions by adding >> ExecLockRelsInfo *execlockrelsinfo. I'm wondering why you didn't just >> put the ExecLockRelsInfo as a new field in PlannedStmt? > I'm worried about that churn myself and did consider this idea, though > I couldn't shake the feeling that it's maybe wrong to put something in > PlannedStmt that the planner itself doesn't produce. PlannedStmt is part of the plan tree, which MUST be read-only to the executor. This is not negotiable. However, there's other places that this data could be put, such as QueryDesc. Or for that matter, couldn't the data structure be created by the planner? (It looks like David is proposing exactly that further down.) regards, tom lane
On Fri, 1 Apr 2022 at 16:09, Amit Langote <amitlangote09@gmail.com> wrote: > definition of PlannedStmt says this: > > /* ---------------- > * PlannedStmt node > * > * The output of the planner > > With the ideas that you've outlined below, perhaps we can frame most > of the things that the patch wants to do as the planner and the > plancache changes. If we twist the above definition a bit to say what > the plancache does in this regard is part of planning, maybe it makes > sense to add the initial pruning related fields (nodes, outputs) into > PlannedStmt. How about the PartitionPruneInfos go into PlannedStmt as a List indexed in the way I mentioned and the cache of the results of pruning in EState? I think that leaves you adding List *partpruneinfos, Bitmapset *minimumlockrtis to PlannedStmt and the thing you have to cache the pruning results into EState. I'm not very clear on where you should stash the results of run-time pruning in the meantime before you can put them in EState. You might need to invent some intermediate struct that gets passed around that you can scribble down some details you're going to need during execution. > One question is whether the planner should always pay the overhead of > initializing this bitmapset? I mean it's only worthwhile if > AcquireExecutorLocks() is going to be involved, that is, the plan will > be cached and reused. Maybe the Bitmapset for the minimal locks needs to be built with bms_add_range(NULL, 0, list_length(rtable)); then do bms_del_members() on the relevant RTIs you find in the listed PartitionPruneInfos. That way it's very simple and cheap to do when there are no PartitionPruneInfos. > > 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting > > revived here. Why don't you just add a partitioned_relids to > > PartitionPruneInfo and just have make_partitionedrel_pruneinfo build > > you a Relids of them. PartitionedRelPruneInfo already has an rtindex > > field, so you just need to bms_add_member whatever that rtindex is. > > Hmm, not all Append/MergeAppend nodes in the plan tree may have > make_partition_pruneinfo() called on them though. For Append/MergeAppends without run-time pruning you'll want to add the RTIs to the minimal locking set of RTIs to go into PlannedStmt. The only things you want to leave out of that are RTIs for the RTEs that you might run-time prune away during AcquireExecutorLocks(). David
On Fri, Apr 1, 2022 at 1:08 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 1 Apr 2022 at 16:09, Amit Langote <amitlangote09@gmail.com> wrote: > > definition of PlannedStmt says this: > > > > /* ---------------- > > * PlannedStmt node > > * > > * The output of the planner > > > > With the ideas that you've outlined below, perhaps we can frame most > > of the things that the patch wants to do as the planner and the > > plancache changes. If we twist the above definition a bit to say what > > the plancache does in this regard is part of planning, maybe it makes > > sense to add the initial pruning related fields (nodes, outputs) into > > PlannedStmt. > > How about the PartitionPruneInfos go into PlannedStmt as a List > indexed in the way I mentioned and the cache of the results of pruning > in EState? > > I think that leaves you adding List *partpruneinfos, Bitmapset > *minimumlockrtis to PlannedStmt and the thing you have to cache the > pruning results into EState. I'm not very clear on where you should > stash the results of run-time pruning in the meantime before you can > put them in EState. You might need to invent some intermediate struct > that gets passed around that you can scribble down some details you're > going to need during execution. Yes, the ExecLockRelsInfo node in the current patch, that first gets added to the QueryDesc and subsequently to the EState of the query, serves as that stashing place. Not sure if you've looked at ExecLockRelInfo in detail in your review of the patch so far, but it carries the initial pruning result in what are called PlanInitPruningOutput nodes, which are stored in a list in ExecLockRelsInfo and their offsets in the list are in turn stored in an adjacent array that contains an element for every plan node in the tree. If we go with a PlannedStmt.partpruneinfos list, then maybe we don't need to have that array, because the Append/MergeAppend nodes would be carrying those offsets by themselves. Maybe a different name for ExecLockRelsInfo would be better? Also, given Tom's apparent dislike for carrying that in PlannedStmt, maybe the way I have it now is fine? > > One question is whether the planner should always pay the overhead of > > initializing this bitmapset? I mean it's only worthwhile if > > AcquireExecutorLocks() is going to be involved, that is, the plan will > > be cached and reused. > > Maybe the Bitmapset for the minimal locks needs to be built with > bms_add_range(NULL, 0, list_length(rtable)); then do > bms_del_members() on the relevant RTIs you find in the listed > PartitionPruneInfos. That way it's very simple and cheap to do when > there are no PartitionPruneInfos. Ah, okay. Looking at make_partition_pruneinfo(), I think I see a way to delete the RTIs of prunable relations -- construct a all_matched_leaf_part_relids in parallel to allmatchedsubplans and delete those from the initial set. > > > 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting > > > revived here. Why don't you just add a partitioned_relids to > > > PartitionPruneInfo and just have make_partitionedrel_pruneinfo build > > > you a Relids of them. PartitionedRelPruneInfo already has an rtindex > > > field, so you just need to bms_add_member whatever that rtindex is. > > > > Hmm, not all Append/MergeAppend nodes in the plan tree may have > > make_partition_pruneinfo() called on them though. > > For Append/MergeAppends without run-time pruning you'll want to add > the RTIs to the minimal locking set of RTIs to go into PlannedStmt. > The only things you want to leave out of that are RTIs for the RTEs > that you might run-time prune away during AcquireExecutorLocks(). Yeah, I see it now. Thanks. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, Apr 1, 2022 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote: > >> 1. You've changed the signature of various functions by adding > >> ExecLockRelsInfo *execlockrelsinfo. I'm wondering why you didn't just > >> put the ExecLockRelsInfo as a new field in PlannedStmt? > > > I'm worried about that churn myself and did consider this idea, though > > I couldn't shake the feeling that it's maybe wrong to put something in > > PlannedStmt that the planner itself doesn't produce. > > PlannedStmt is part of the plan tree, which MUST be read-only to > the executor. This is not negotiable. However, there's other > places that this data could be put, such as QueryDesc. > Or for that matter, couldn't the data structure be created by > the planner? (It looks like David is proposing exactly that > further down.) The data structure in question is for storing the results of performing initial partition pruning on a generic plan, which the proposes to do in plancache.c -- inside the body of AcquireExecutorLocks()'s loop over PlannedStmts -- so, it's hard to see it as a product of the planner. :-( -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote: > Yes, the ExecLockRelsInfo node in the current patch, that first gets > added to the QueryDesc and subsequently to the EState of the query, > serves as that stashing place. Not sure if you've looked at > ExecLockRelInfo in detail in your review of the patch so far, but it > carries the initial pruning result in what are called > PlanInitPruningOutput nodes, which are stored in a list in > ExecLockRelsInfo and their offsets in the list are in turn stored in > an adjacent array that contains an element for every plan node in the > tree. If we go with a PlannedStmt.partpruneinfos list, then maybe we > don't need to have that array, because the Append/MergeAppend nodes > would be carrying those offsets by themselves. I saw it, just not in great detail. I saw that you had an array that was indexed by the plan node's ID. I thought that wouldn't be so good with large complex plans that we often get with partitioning workloads. That's why I mentioned using another index that you store in Append/MergeAppend that starts at 0 and increments by 1 for each node that has a PartitionPruneInfo made for it during create_plan. > Maybe a different name for ExecLockRelsInfo would be better? > > Also, given Tom's apparent dislike for carrying that in PlannedStmt, > maybe the way I have it now is fine? I think if you change how it's indexed and the other stuff then we can have another look. I think the patch will be much easier to review once the ParitionPruneInfos are moved into PlannedStmt. David
On Fri, Apr 1, 2022 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote: > > Yes, the ExecLockRelsInfo node in the current patch, that first gets > > added to the QueryDesc and subsequently to the EState of the query, > > serves as that stashing place. Not sure if you've looked at > > ExecLockRelInfo in detail in your review of the patch so far, but it > > carries the initial pruning result in what are called > > PlanInitPruningOutput nodes, which are stored in a list in > > ExecLockRelsInfo and their offsets in the list are in turn stored in > > an adjacent array that contains an element for every plan node in the > > tree. If we go with a PlannedStmt.partpruneinfos list, then maybe we > > don't need to have that array, because the Append/MergeAppend nodes > > would be carrying those offsets by themselves. > > I saw it, just not in great detail. I saw that you had an array that > was indexed by the plan node's ID. I thought that wouldn't be so good > with large complex plans that we often get with partitioning > workloads. That's why I mentioned using another index that you store > in Append/MergeAppend that starts at 0 and increments by 1 for each > node that has a PartitionPruneInfo made for it during create_plan. > > > Maybe a different name for ExecLockRelsInfo would be better? > > > > Also, given Tom's apparent dislike for carrying that in PlannedStmt, > > maybe the way I have it now is fine? > > I think if you change how it's indexed and the other stuff then we can > have another look. I think the patch will be much easier to review > once the ParitionPruneInfos are moved into PlannedStmt. Will do, thanks. -- Amit Langote EDB: http://www.enterprisedb.com
I noticed a definitional problem in 0001 that's also a bug in some conditions -- namely that the bitmapset "validplans" is never explicitly initialized to NIL. In the original coding, the BMS was always returned from somewhere; in the new code, it is passed from an uninitialized stack variable into the new ExecInitPartitionPruning function, which then proceeds to add new members to it without initializing it first. Indeed that function's header comment explicitly indicates that it is not initialized: + * Initial pruning can be done immediately, so it is done here if needed and + * the set of surviving partition subplans' indexes are added to the output + * parameter *initially_valid_subplans. even though this is not fully correct, because when prunestate->do_initial_prune is false, then the BMS *is* initialized. I have no opinion on where to initialize it, but it needs to be done somewhere and the comment needs to agree. I think the names ExecCreatePartitionPruneState and ExecInitPartitionPruning are too confusingly similar. Maybe the former should be renamed to somehow make it clear that it is a subroutine for the former. At the top of the file, there's a new comment that reads: * ExecInitPartitionPruning: * Creates the PartitionPruneState required by each of the two pruning * functions. What are "the two pruning functions"? I think here you mean "Append" and "MergeAppend". Maybe spell that out explicitly. I think this comment needs to be reworded: + * Subplans would previously be indexed 0..(n_total_subplans - 1) should be + * changed to index range 0..num(initially_valid_subplans). -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Thanks for the review. On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I noticed a definitional problem in 0001 that's also a bug in some > conditions -- namely that the bitmapset "validplans" is never explicitly > initialized to NIL. In the original coding, the BMS was always returned > from somewhere; in the new code, it is passed from an uninitialized > stack variable into the new ExecInitPartitionPruning function, which > then proceeds to add new members to it without initializing it first. Hmm, the following blocks in ExecInitPartitionPruning() define *initially_valid_subplans: /* * Perform an initial partition prune pass, if required. */ if (prunestate->do_initial_prune) { /* Determine which subplans survive initial pruning */ *initially_valid_subplans = ExecFindInitialMatchingSubPlans(prunestate); } else { /* We'll need to initialize all subplans */ Assert(n_total_subplans > 0); *initially_valid_subplans = bms_add_range(NULL, 0, n_total_subplans - 1); } AFAICS, both assign *initially_valid_subplans a value whose computation is not dependent on reading it first, so I don't see a problem. Am I missing something? > Indeed that function's header comment explicitly indicates that it is > not initialized: > > + * Initial pruning can be done immediately, so it is done here if needed and > + * the set of surviving partition subplans' indexes are added to the output > + * parameter *initially_valid_subplans. > > even though this is not fully correct, because when prunestate->do_initial_prune > is false, then the BMS *is* initialized. > > I have no opinion on where to initialize it, but it needs to be done > somewhere and the comment needs to agree. I can see that the comment is insufficient, so I've expanded it as follows: - * Initial pruning can be done immediately, so it is done here if needed and - * the set of surviving partition subplans' indexes are added to the output - * parameter *initially_valid_subplans. + * On return, *initially_valid_subplans is assigned the set of indexes of + * child subplans that must be initialized along with the parent plan node. + * Initial pruning is performed here if needed and in that case only the + * surviving subplans' indexes are added. > I think the names ExecCreatePartitionPruneState and > ExecInitPartitionPruning are too confusingly similar. Maybe the former > should be renamed to somehow make it clear that it is a subroutine for > the former. Ah, yes. I've taken out the "Exec" from the former. > At the top of the file, there's a new comment that reads: > > * ExecInitPartitionPruning: > * Creates the PartitionPruneState required by each of the two pruning > * functions. > > What are "the two pruning functions"? I think here you mean "Append" > and "MergeAppend". Maybe spell that out explicitly. Actually it meant: ExecFindInitiaMatchingSubPlans() and ExecFindMatchingSubPlans(). They perform "initial" and "exec" set of pruning steps, respectively. I realized that both functions have identical bodies at this point, except that they pass 'true' and 'false', respectively, for initial_prune argument of the sub-routine find_matching_subplans_recurse(), which is where the pruning using the appropriate set of steps contained in PartitionPruneState (initial_pruning_steps or exec_pruning_steps) actually occurs. So, I've updated the patch to just retain the latter, adding an initial_prune parameter to it to pass to the aforementioned find_matching_subplans_recurse(). I've also updated the run-time pruning module comment to describe this change: * ExecFindMatchingSubPlans: - * Returns indexes of matching subplans after evaluating all available - * expressions, that is, using execution pruning steps. This function can - * can only be called during execution and must be called again each time - * the value of a Param listed in PartitionPruneState's 'execparamids' - * changes. + * Returns indexes of matching subplans after evaluating the expressions + * that are safe to evaluate at a given point. This function is first + * called during ExecInitPartitionPruning() to find the initially + * matching subplans based on performing the initial pruning steps and + * then must be called again each time the value of a Param listed in + * PartitionPruneState's 'execparamids' changes. > I think this comment needs to be reworded: > > + * Subplans would previously be indexed 0..(n_total_subplans - 1) should be > + * changed to index range 0..num(initially_valid_subplans). Assuming you meant to ask to write this without the odd notation, I've expanded the comment as follows: - * Subplans would previously be indexed 0..(n_total_subplans - 1) should be - * changed to index range 0..num(initially_valid_subplans). + * Current values of the indexes present in PartitionPruneState count all the + * subplans that would be present before initial pruning was done. If initial + * pruning got rid of some of the subplans, any subsequent pruning passes will + * will be looking at a different set of target subplans to choose from than + * those in the pre-initial-pruning set, so the maps in PartitionPruneState + * containing those indexes must be updated to reflect the new indexes of + * subplans in the post-initial-pruning set. I've attached only the updated 0001, though I'm still working on the others to address David's comments. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 4, 2022 at 9:55 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I think the names ExecCreatePartitionPruneState and > > ExecInitPartitionPruning are too confusingly similar. Maybe the former > > should be renamed to somehow make it clear that it is a subroutine for > > the former. > > Ah, yes. I've taken out the "Exec" from the former. While at it, maybe it's better to rename ExecInitPruningContext() to InitPartitionPruneContext(), which I've done in the attached updated patch. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On 2022-Apr-05, Amit Langote wrote: > While at it, maybe it's better to rename ExecInitPruningContext() to > InitPartitionPruneContext(), which I've done in the attached updated > patch. Good call. I had changed that name too, but yours seems a better choice. I made a few other cosmetic changes and pushed. I'm afraid this will cause a few conflicts with your 0004 -- hopefully these should mostly be minor. One change that's not completely cosmetic is a change in the test on whether to call PartitionPruneFixSubPlanMap or not. Originally it was: if (partprune->do_exec_prune && bms_num_members( ... )) do_stuff(); which meant that bms_num_members() is only evaluated if do_exec_prune. However, the do_exec_prune bit is an optimization (we can skip doing that stuff if it's not going to be used), but the other test is more strict: the stuff is completely irrelevant if no plans have been removed, since the data structure does not need fixing. So I changed it to be like this if (bms_num_members( .. )) { /* can skip if it's pointless */ if (do_exec_prune) do_stuff(); } I think that it is clearer to the human reader this way; and I think a smart compiler may realize that the test can be reversed and avoid counting bits when it's pointless. So your 0004 patch should add the new condition to the outer if(), since it's a critical consideration rather than an optimization: if (partprune && bms_num_members()) { /* can skip if pointless */ if (do_exec_prune) do_stuff() } Now, if we disagree and think that counting bits in the BMS when it's going to be discarded by do_exec_prune being false, then we can flip that back as originally and a more explicit comment. With no evidence, I doubt it matters. Thanks for the patch! I think the new coding is indeed a bit easier to follow. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ <inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell <crab> inflex: you know that "amalgam" means "mixture with mercury", more or less, right? <crab> i.e., "deadly poison"
On Tue, Apr 5, 2022 at 7:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Apr-05, Amit Langote wrote: > > While at it, maybe it's better to rename ExecInitPruningContext() to > > InitPartitionPruneContext(), which I've done in the attached updated > > patch. > > Good call. I had changed that name too, but yours seems a better > choice. > > I made a few other cosmetic changes and pushed. Thanks! > I'm afraid this will > cause a few conflicts with your 0004 -- hopefully these should mostly be > minor. > > One change that's not completely cosmetic is a change in the test on > whether to call PartitionPruneFixSubPlanMap or not. Originally it was: > > if (partprune->do_exec_prune && > bms_num_members( ... )) > do_stuff(); > > which meant that bms_num_members() is only evaluated if do_exec_prune. > However, the do_exec_prune bit is an optimization (we can skip doing > that stuff if it's not going to be used), but the other test is more > strict: the stuff is completely irrelevant if no plans have been > removed, since the data structure does not need fixing. So I changed it > to be like this > > if (bms_num_members( .. )) > { > /* can skip if it's pointless */ > if (do_exec_prune) > do_stuff(); > } > > I think that it is clearer to the human reader this way; and I think a > smart compiler may realize that the test can be reversed and avoid > counting bits when it's pointless. > > So your 0004 patch should add the new condition to the outer if(), since > it's a critical consideration rather than an optimization: > if (partprune && bms_num_members()) > { > /* can skip if pointless */ > if (do_exec_prune) > do_stuff() > } > > Now, if we disagree and think that counting bits in the BMS when it's > going to be discarded by do_exec_prune being false, then we can flip > that back as originally and a more explicit comment. With no evidence, > I doubt it matters. I agree that counting bits in the outer condition makes this easier to read, so see no problem with keeping it that way. Will post the rebased main patch soon, whose rewrite I'm close to being done with. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, Apr 1, 2022 at 5:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Apr 1, 2022 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote: > > > Yes, the ExecLockRelsInfo node in the current patch, that first gets > > > added to the QueryDesc and subsequently to the EState of the query, > > > serves as that stashing place. Not sure if you've looked at > > > ExecLockRelInfo in detail in your review of the patch so far, but it > > > carries the initial pruning result in what are called > > > PlanInitPruningOutput nodes, which are stored in a list in > > > ExecLockRelsInfo and their offsets in the list are in turn stored in > > > an adjacent array that contains an element for every plan node in the > > > tree. If we go with a PlannedStmt.partpruneinfos list, then maybe we > > > don't need to have that array, because the Append/MergeAppend nodes > > > would be carrying those offsets by themselves. > > > > I saw it, just not in great detail. I saw that you had an array that > > was indexed by the plan node's ID. I thought that wouldn't be so good > > with large complex plans that we often get with partitioning > > workloads. That's why I mentioned using another index that you store > > in Append/MergeAppend that starts at 0 and increments by 1 for each > > node that has a PartitionPruneInfo made for it during create_plan. > > > > > Maybe a different name for ExecLockRelsInfo would be better? > > > > > > Also, given Tom's apparent dislike for carrying that in PlannedStmt, > > > maybe the way I have it now is fine? > > > > I think if you change how it's indexed and the other stuff then we can > > have another look. I think the patch will be much easier to review > > once the ParitionPruneInfos are moved into PlannedStmt. > > Will do, thanks. And here is a version like that that passes make check-world. Maybe still a WIP as I think comments could use more editing. Here's how the new implementation works: AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn iterates over a list of PartitionPruneInfos in a given PlannedStmt coming from a CachedPlan. For each PartitionPruneInfo, ExecPartitionDoInitialPruning() is called, which sets up PartitionPruneState and performs initial pruning steps present in the PartitionPruneInfo. The resulting bitmapsets of valid subplans, one for each PartitionPruneInfo, are collected in a list and added to a result node called PartitionPruneResult. It represents the result of performing initial pruning on all PartitionPruneInfos found in a plan. A list of PartitionPruneResults is passed along with the PlannedStmt to the executor, which is referenced when initializing Append/MergeAppend nodes. PlannedStmt.minLockRelids defined by the planner contains the RT indexes of all the entries in the range table minus those of the leaf partitions whose subplans are subject to removal due to initial pruning. AcquireExecutoLocks() adds back the RT indexes of only those leaf partitions whose subplans survive ExecutorDoInitialPruning(). To get the leaf partition RT indexes from the PartitionPruneInfo, a new rti_map array is added to PartitionedRelPruneInfo. There's only one patch this time. Patches that added partitioned_rels and plan_tree_walker() are no longer necessary. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Wed, Apr 6, 2022 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote: > And here is a version like that that passes make check-world. Maybe > still a WIP as I think comments could use more editing. > > Here's how the new implementation works: > > AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn > iterates over a list of PartitionPruneInfos in a given PlannedStmt > coming from a CachedPlan. For each PartitionPruneInfo, > ExecPartitionDoInitialPruning() is called, which sets up > PartitionPruneState and performs initial pruning steps present in the > PartitionPruneInfo. The resulting bitmapsets of valid subplans, one > for each PartitionPruneInfo, are collected in a list and added to a > result node called PartitionPruneResult. It represents the result of > performing initial pruning on all PartitionPruneInfos found in a plan. > A list of PartitionPruneResults is passed along with the PlannedStmt > to the executor, which is referenced when initializing > Append/MergeAppend nodes. > > PlannedStmt.minLockRelids defined by the planner contains the RT > indexes of all the entries in the range table minus those of the leaf > partitions whose subplans are subject to removal due to initial > pruning. AcquireExecutoLocks() adds back the RT indexes of only those > leaf partitions whose subplans survive ExecutorDoInitialPruning(). To > get the leaf partition RT indexes from the PartitionPruneInfo, a new > rti_map array is added to PartitionedRelPruneInfo. > > There's only one patch this time. Patches that added partitioned_rels > and plan_tree_walker() are no longer necessary. Here's an updated version. In Particular, I removed part_prune_results list from PortalData, in favor of anything that needs to look at the list can instead get it from the CachedPlan (PortalData.cplan). This makes things better in 2 ways: * All the changes that were needed to produce the list to be pass to PortalDefineQuery() are now unnecessary (especially ugly ones were those made to pg_plan_queries()'s interface) * The cases in which the PartitionPruneResult being added to a QueryDesc can be assumed to be valid is more clearly define now; it's the cases where the portal's CachedPlan is also valid, that is, if the accompanying PlannedStmt is a cached one. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote: > Here's an updated version. In Particular, I removed > part_prune_results list from PortalData, in favor of anything that > needs to look at the list can instead get it from the CachedPlan > (PortalData.cplan). This makes things better in 2 ways: Thanks for making those changes. I'm not overly familiar with the data structures we use for planning around plans between the planner and executor, but storing the pruning results in CachedPlan seems pretty bad. I see you've stashed it in there and invented a new memory context to stop leaks into the cache memory. Since I'm not overly familiar with these structures, I'm trying to imagine why you made that choice and the best I can come up with was that it was the most convenient thing you had to hand inside CheckCachedPlan(). I don't really have any great ideas right now on how to make this better. I wonder if GetCachedPlan() should be changed to return some struct that wraps up the CachedPlan with some sort of executor prep info struct that we can stash the list of PartitionPruneResults in, and perhaps something else one day. Some lesser important stuff that I think could be done better. * Are you also able to put meaningful comments on the PartitionPruneResult struct in execnodes.h? * In create_append_plan() and create_merge_append_plan() you have the same code to set the part_prune_index. Why not just move all that code into make_partition_pruneinfo() and have make_partition_pruneinfo() return the index and append to the PlannerInfo.partPruneInfos List? * Why not forboth() here? i = 0; foreach(stmtlist_item, portal->stmts) { PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item); PartitionPruneResult *part_prune_result = part_prune_results ? list_nth(part_prune_results, i) : NULL; i++; * It would be good if ReleaseExecutorLocks() already knew the RTIs that were locked. Maybe the executor prep info struct I mentioned above could also store the RTIs that have been locked already and allow ReleaseExecutorLocks() to just iterate over those to release the locks. David
On Thu, Apr 7, 2022 at 9:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote: > > Here's an updated version. In Particular, I removed > > part_prune_results list from PortalData, in favor of anything that > > needs to look at the list can instead get it from the CachedPlan > > (PortalData.cplan). This makes things better in 2 ways: > > Thanks for making those changes. > > I'm not overly familiar with the data structures we use for planning > around plans between the planner and executor, but storing the pruning > results in CachedPlan seems pretty bad. I see you've stashed it in > there and invented a new memory context to stop leaks into the cache > memory. > > Since I'm not overly familiar with these structures, I'm trying to > imagine why you made that choice and the best I can come up with was > that it was the most convenient thing you had to hand inside > CheckCachedPlan(). Yeah, it's that way because it felt convenient, though I have wondered if a simpler scheme that doesn't require any changes to the CachedPlan data structure might be better after all. Your pointing it out has made me think a bit harder on that. > I don't really have any great ideas right now on how to make this > better. I wonder if GetCachedPlan() should be changed to return some > struct that wraps up the CachedPlan with some sort of executor prep > info struct that we can stash the list of PartitionPruneResults in, > and perhaps something else one day. I think what might be better to do now is just add an output List parameter to GetCachedPlan() to add the PartitionPruneResult node to instead of stashing them into CachedPlan as now. IMHO, we should leave inventing a new generic struct to the next project that will make it necessary to return more information from GetCachedPlan() to its users. I find it hard to convincingly describe what the new generic struct really is if we invent it *now*, when it's going to carry a single list whose purpose is pretty narrow. So, I've implemented this by making the callers of GetCachedPlan() pass a list to add the PartitionPruneResults that may be produced. Most callers can put that into the Portal for passing that to other modules, so I have reinstated PortalData.part_prune_results. As for its memory management, the list and the PartitionPruneResults therein will be allocated in a context that holds the Portal itself. > Some lesser important stuff that I think could be done better. > > * Are you also able to put meaningful comments on the > PartitionPruneResult struct in execnodes.h? > > * In create_append_plan() and create_merge_append_plan() you have the > same code to set the part_prune_index. Why not just move all that code > into make_partition_pruneinfo() and have make_partition_pruneinfo() > return the index and append to the PlannerInfo.partPruneInfos List? That sounds better, so done. > * Why not forboth() here? > > i = 0; > foreach(stmtlist_item, portal->stmts) > { > PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item); > PartitionPruneResult *part_prune_result = part_prune_results ? > list_nth(part_prune_results, i) : > NULL; > > i++; Because the PartitionPruneResult list may not always be available. To wit, it's only available when it is GetCachedPlan() that gave the portal its plan. I know this is a bit ugly, but it seems better than fixing all users of Portal to build a dummy list, not that it is totally avoidable even in the current implementation. > * It would be good if ReleaseExecutorLocks() already knew the RTIs > that were locked. Maybe the executor prep info struct I mentioned > above could also store the RTIs that have been locked already and > allow ReleaseExecutorLocks() to just iterate over those to release the > locks. Rewrote this such that ReleaseExecutorLocks() just receives a list of per-PlannedStmt bitmapsets containing the RT indexes of only the locked entries in that plan. Attached updated patch with these changes. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote: > Attached updated patch with these changes. Thanks for making the changes. I started looking over this patch but really feel like it needs quite a few more iterations of what we've just been doing to get it into proper committable shape. There seems to be only about 40 mins to go before the freeze, so it seems very unrealistic that it could be made to work. I started trying to take a serious look at it this evening, but I feel like I just failed to get into it deep enough to make any meaningful improvements. I'd need more time to study the problem before I could build up a proper opinion on how exactly I think it should work. Anyway. I've attached a small patch that's just a few things I adjusted or questions while reading over your v13 patch. Some of these are just me questioning your code (See XXX comments) and some I think are improvements. Feel free to take the hunks that you see fit and drop anything you don't. David
Attachment
Hi David, On Fri, Apr 8, 2022 at 8:16 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote: > > Attached updated patch with these changes. > Thanks for making the changes. I started looking over this patch but > really feel like it needs quite a few more iterations of what we've > just been doing to get it into proper committable shape. There seems > to be only about 40 mins to go before the freeze, so it seems very > unrealistic that it could be made to work. Yeah, totally understandable. > I started trying to take a serious look at it this evening, but I feel > like I just failed to get into it deep enough to make any meaningful > improvements. I'd need more time to study the problem before I could > build up a proper opinion on how exactly I think it should work. > > Anyway. I've attached a small patch that's just a few things I > adjusted or questions while reading over your v13 patch. Some of > these are just me questioning your code (See XXX comments) and some I > think are improvements. Feel free to take the hunks that you see fit > and drop anything you don't. Thanks a lot for compiling those. Most looked fine changes to me except a couple of typos, so I've adopted those into the attached new version, even though I know it's too late to try to apply it. Re the XXX comments: + /* XXX why would pprune->rti_map[i] ever be zero here??? */ Yeah, no there can't be, was perhaps being overly paraioid. + * XXX is it worth doing a bms_copy() on glob->minLockRelids if + * glob->containsInitialPruning is true?. I'm slighly worried that the + * Bitmapset could have a very long empty tail resulting in excessive + * looping during AcquireExecutorLocks(). + */ I guess I trust your instincts about bitmapset operation efficiency and what you've written here makes sense. It's typical for leaf partitions to have been appended toward the tail end of rtable and I'd imagine their indexes would be in the tail words of minLockRelids. If copying the bitmapset removes those useless words, I don't see why we shouldn't do that. So added: + /* + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted + * bit from it just above to prevent empty tail bits resulting in + * inefficient looping during AcquireExecutorLocks(). + */ + if (glob->containsInitialPruning) + glob->minLockRelids = bms_copy(glob->minLockRelids) Not 100% about the comment I wrote. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote: > Most looked fine changes to me except a couple of typos, so I've > adopted those into the attached new version, even though I know it's > too late to try to apply it. > > + * XXX is it worth doing a bms_copy() on glob->minLockRelids if > + * glob->containsInitialPruning is true?. I'm slighly worried that the > + * Bitmapset could have a very long empty tail resulting in excessive > + * looping during AcquireExecutorLocks(). > + */ > > I guess I trust your instincts about bitmapset operation efficiency > and what you've written here makes sense. It's typical for leaf > partitions to have been appended toward the tail end of rtable and I'd > imagine their indexes would be in the tail words of minLockRelids. If > copying the bitmapset removes those useless words, I don't see why we > shouldn't do that. So added: > > + /* > + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted > + * bit from it just above to prevent empty tail bits resulting in > + * inefficient looping during AcquireExecutorLocks(). > + */ > + if (glob->containsInitialPruning) > + glob->minLockRelids = bms_copy(glob->minLockRelids) > > Not 100% about the comment I wrote. And the quoted code change missed a semicolon in the v14 that I hurriedly sent on Friday. (Had apparently forgotten to `git add` the hunk to fix that). Sending v15 that fixes that to keep the cfbot green for now. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Most looked fine changes to me except a couple of typos, so I've
> adopted those into the attached new version, even though I know it's
> too late to try to apply it.
>
> + * XXX is it worth doing a bms_copy() on glob->minLockRelids if
> + * glob->containsInitialPruning is true?. I'm slighly worried that the
> + * Bitmapset could have a very long empty tail resulting in excessive
> + * looping during AcquireExecutorLocks().
> + */
>
> I guess I trust your instincts about bitmapset operation efficiency
> and what you've written here makes sense. It's typical for leaf
> partitions to have been appended toward the tail end of rtable and I'd
> imagine their indexes would be in the tail words of minLockRelids. If
> copying the bitmapset removes those useless words, I don't see why we
> shouldn't do that. So added:
>
> + /*
> + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
> + * bit from it just above to prevent empty tail bits resulting in
> + * inefficient looping during AcquireExecutorLocks().
> + */
> + if (glob->containsInitialPruning)
> + glob->minLockRelids = bms_copy(glob->minLockRelids)
>
> Not 100% about the comment I wrote.
And the quoted code change missed a semicolon in the v14 that I
hurriedly sent on Friday. (Had apparently forgotten to `git add` the
hunk to fix that).
Sending v15 that fixes that to keep the cfbot green for now.
--
Amit Langote
EDB: http://www.enterprisedb.com
Hi,
+ /* RT index of the partitione table. */
partitione -> partitioned
Cheers
On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote: > On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote: >> Sending v15 that fixes that to keep the cfbot green for now. > > Hi, > > + /* RT index of the partitione table. */ > > partitione -> partitioned Thanks, fixed. Also, I broke this into patches: 0001 contains the mechanical changes of moving PartitionPruneInfo out of Append/MergeAppend into a list in PlannedStmt. 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking only unpruned partitions". -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, May 27, 2022 at 1:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> Sending v15 that fixes that to keep the cfbot green for now.
>
> Hi,
>
> + /* RT index of the partitione table. */
>
> partitione -> partitioned
Thanks, fixed.
Also, I broke this into patches:
0001 contains the mechanical changes of moving PartitionPruneInfo out
of Append/MergeAppend into a list in PlannedStmt.
0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
only unpruned partitions".
--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
Hi,
In the description:
PartitionPruneResult, made available along with the PlannedStmt by the
I think the second `made available` is redundant (can be omitted).
+ * Initial pruning is performed here if needed (unless it has already been done
+ * by ExecDoInitialPruning()), and in that case only the surviving subplans'
+ * by ExecDoInitialPruning()), and in that case only the surviving subplans'
I wonder if there is a typo above - I don't find ExecDoInitialPruning either in PG codebase or in the patches (except for this in the comment).
I think it should be ExecutorDoInitialPruning.
+ * bit from it just above to prevent empty tail bits resulting in
I searched in the code base but didn't find mentioning of `empty tail bit`. Do you mind explaining a bit about it ?
Cheers
On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote: > 0001 contains the mechanical changes of moving PartitionPruneInfo out > of Append/MergeAppend into a list in PlannedStmt. > > 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking > only unpruned partitions". This patchset will need to be rebased over 835d476fd21; looks like just a cosmetic change. --Jacob
On Wed, Jul 6, 2022 at 2:43 AM Jacob Champion <jchampion@timescale.com> wrote: > On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote: > > 0001 contains the mechanical changes of moving PartitionPruneInfo out > > of Append/MergeAppend into a list in PlannedStmt. > > > > 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking > > only unpruned partitions". > > This patchset will need to be rebased over 835d476fd21; looks like > just a cosmetic change. Thanks for the heads up. Rebased and also fixed per comments given by Zhihong Yu on May 28. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
Rebased over 964d01ae90c. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote: > Rebased over 964d01ae90c. Sorry, left some pointless hunks in there while rebasing. Fixed in the attached. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Wed, Jul 13, 2022 at 4:03 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Rebased over 964d01ae90c. > > Sorry, left some pointless hunks in there while rebasing. Fixed in > the attached. Needed to be rebased again, over 2d04277121f this time. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Tue, Jul 26, 2022 at 11:01 PM Amit Langote <amitlangote09@gmail.com> wrote: > Needed to be rebased again, over 2d04277121f this time. 0001 adds es_part_prune_result but does not use it, so maybe the introduction of that field should be deferred until it's needed for something. I wonder whether it's really necessary to added the PartitionPruneInfo objects to a list in PlannerInfo first and then roll them up into PlannerGlobal later. I know we do that for range table entries, but I've never quite understood why we do it that way instead of creating a flat range table in PlannerGlobal from the start. And so by extension I wonder whether this table couldn't be flat from the start also. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 26, 2022 at 11:01 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Needed to be rebased again, over 2d04277121f this time. Thanks for looking. > 0001 adds es_part_prune_result but does not use it, so maybe the > introduction of that field should be deferred until it's needed for > something. Oops, looks like a mistake when breaking the patch. Will move that bit to 0002. > I wonder whether it's really necessary to added the PartitionPruneInfo > objects to a list in PlannerInfo first and then roll them up into > PlannerGlobal later. I know we do that for range table entries, but > I've never quite understood why we do it that way instead of creating > a flat range table in PlannerGlobal from the start. And so by > extension I wonder whether this table couldn't be flat from the start > also. Tom may want to correct me but my understanding of why the planner waits till the end of planning to start populating the PlannerGlobal range table is that it is not until then that we know which subqueries will be scanned by the final plan tree, so also whose range table entries will be included in the range table passed to the executor. I can see that subquery pull-up causes a pulled-up subquery's range table entries to be added into the parent's query's and all its nodes changed using OffsetVarNodes() to refer to the new RT indexes. But for subqueries that are not pulled up, their subplans' nodes (present in PlannerGlboal.subplans) would still refer to the original RT indexes (per range table in the corresponding PlannerGlobal.subroot), which must be fixed and the end of planning is the time to do so. Or maybe that could be done when build_subplan() creates a subplan and adds it to PlannerGlobal.subplans, but for some reason it's not? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Amit Langote <amitlangote09@gmail.com> writes: > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: >> I wonder whether it's really necessary to added the PartitionPruneInfo >> objects to a list in PlannerInfo first and then roll them up into >> PlannerGlobal later. I know we do that for range table entries, but >> I've never quite understood why we do it that way instead of creating >> a flat range table in PlannerGlobal from the start. And so by >> extension I wonder whether this table couldn't be flat from the start >> also. > Tom may want to correct me but my understanding of why the planner > waits till the end of planning to start populating the PlannerGlobal > range table is that it is not until then that we know which subqueries > will be scanned by the final plan tree, so also whose range table > entries will be included in the range table passed to the executor. It would not be profitable to flatten the range table before we've done remove_useless_joins. We'd end up with useless entries from subqueries that ultimately aren't there. We could perhaps do it after we finish that phase, but I don't really see the point: it wouldn't be better than what we do now, just the same work at a different time. regards, tom lane
On Fri, Jul 29, 2022 at 12:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It would not be profitable to flatten the range table before we've > done remove_useless_joins. We'd end up with useless entries from > subqueries that ultimately aren't there. We could perhaps do it > after we finish that phase, but I don't really see the point: it > wouldn't be better than what we do now, just the same work at a > different time. That's not quite my question, though. Why do we ever build a non-flat range table in the first place? Like, instead of assigning indexes relative to the current subquery level, why not just assign them relative to the whole query from the start? It can't really be that we've done it this way because of remove_useless_joins(), because we've been building separate range tables and later flattening them for longer than join removal has existed as a feature. What bugs me is that it's very much not free. By building a bunch of separate range tables and combining them later, we generate extra work: we have to go back and adjust RT indexes after-the-fact. We pay that overhead for every query, not just the ones that end up with some unused entries in the range table. And why would it matter if we did end up with some useless entries in the range table, anyway? If there's some semantic difference, we could add a flag to mark those entries as needing to be ignored, which seems way better than crawling all over the whole tree adjusting RTIs everywhere. I don't really expect that we're ever going to change this -- and certainly not on this thread. The idea of running around and replacing RT indexes all over the tree is deeply embedded in the system. But are we really sure we want to add a second kind of index that we have to run around and adjust at the same time? If we are, so be it, I guess. It just looks really ugly and unnecessary to me. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > That's not quite my question, though. Why do we ever build a non-flat > range table in the first place? Like, instead of assigning indexes > relative to the current subquery level, why not just assign them > relative to the whole query from the start? We could probably make that work, but I'm skeptical that it would really be an improvement overall, for a couple of reasons. (1) The need for merge-rangetables-and-renumber-Vars logic doesn't go away. It just moves from setrefs.c to the rewriter, which would have to do it when expanding views. This would be a net loss performance-wise, I think, because setrefs.c can do it as part of a parsetree scan that it has to perform anyway for other housekeeping reasons; but the rewriter would need a brand new pass over the tree. Admittedly that pass would only happen for view replacement, but it's still not open-and-shut that there'd be a performance win. (2) The need for varlevelsup and similar fields doesn't go away, I think, because we need those for semantic purposes such as discovering the query level that aggregates are associated with. That means that subquery flattening still has to make a pass over the tree to touch every Var's varlevelsup; so not having to adjust varno at the same time would save little. I'm not sure whether I think it's a net plus or net minus that varno would become effectively independent of varlevelsup. It'd be different from the way we think of them now, for sure, and I think it'd take awhile to flush out bugs arising from such a redefinition. > I don't really expect that we're ever going to change this -- and > certainly not on this thread. The idea of running around and replacing > RT indexes all over the tree is deeply embedded in the system. But are > we really sure we want to add a second kind of index that we have to > run around and adjust at the same time? You probably want to avert your eyes from [1], then ;-). Although I'm far from convinced that the cross-list index fields currently proposed there are actually necessary; the cost to adjust them during rangetable merging could outweigh any benefit. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CA+HiwqGjJDmUhDSfv-U2qhKJjt9ST7Xh9JXC_irsAQ1TAUsJYg@mail.gmail.com
On Fri, Jul 29, 2022 at 11:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > We could probably make that work, but I'm skeptical that it would > really be an improvement overall, for a couple of reasons. > > (1) The need for merge-rangetables-and-renumber-Vars logic doesn't > go away. It just moves from setrefs.c to the rewriter, which would > have to do it when expanding views. This would be a net loss > performance-wise, I think, because setrefs.c can do it as part of a > parsetree scan that it has to perform anyway for other housekeeping > reasons; but the rewriter would need a brand new pass over the tree. > Admittedly that pass would only happen for view replacement, but > it's still not open-and-shut that there'd be a performance win. > > (2) The need for varlevelsup and similar fields doesn't go away, > I think, because we need those for semantic purposes such as > discovering the query level that aggregates are associated with. > That means that subquery flattening still has to make a pass over > the tree to touch every Var's varlevelsup; so not having to adjust > varno at the same time would save little. > > I'm not sure whether I think it's a net plus or net minus that > varno would become effectively independent of varlevelsup. > It'd be different from the way we think of them now, for sure, > and I think it'd take awhile to flush out bugs arising from such > a redefinition. Interesting. Thanks for your thoughts. I guess it's not as clear-cut as I thought, but I still can't help feeling like we're doing an awful lot of expensive rearrangement at the end of query planning. I kind of wonder whether varlevelsup is the wrong idea. Like, suppose we instead handed out subquery identifiers serially, sort of like what we do with SubTransactionId values. Then instead of testing whether varlevelsup>0 you test whether varsubqueryid==mysubqueryid. If you flatten a query into its parent, you still need to adjust every var that refers to the dead subquery, but you don't need to adjust vars that refer to subqueries underneath it. Their level changes, but their identity doesn't. Maybe that doesn't really help that much, but it's always struck me as a little unfortunate that we basically test whether a var is equal by testing whether the varno and varlevelsup are equal. That only works if you assume that you can never end up comparing two vars from thoroughly unrelated parts of the tree, such that the subquery one level up from one might be different from the subquery one level up from the other. > > I don't really expect that we're ever going to change this -- and > > certainly not on this thread. The idea of running around and replacing > > RT indexes all over the tree is deeply embedded in the system. But are > > we really sure we want to add a second kind of index that we have to > > run around and adjust at the same time? > > You probably want to avert your eyes from [1], then ;-). Although > I'm far from convinced that the cross-list index fields currently > proposed there are actually necessary; the cost to adjust them > during rangetable merging could outweigh any benefit. I really like the idea of that patch overall, actually; I think permissions checking is a good example of something that shouldn't require walking the whole query tree but currently does. And actually, I think the same thing is true here: we shouldn't need to walk the whole query tree to find the pruning information, but right now we do. I'm just uncertain whether what Amit has implemented is the least-annoying way to go about it... any thoughts on that, specifically as it pertains to this patch? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > ... it's > always struck me as a little unfortunate that we basically test > whether a var is equal by testing whether the varno and varlevelsup > are equal. That only works if you assume that you can never end up > comparing two vars from thoroughly unrelated parts of the tree, such > that the subquery one level up from one might be different from the > subquery one level up from the other. Yeah, that's always bothered me a little as well. I've yet to see a case where it causes a problem in practice. But I think that if, say, we were to try to do any sort of cross-query-level optimization, then the ambiguity could rise up to bite us. That might be a situation where a flat rangetable would be worth the trouble. > I'm just uncertain whether what Amit has implemented is the > least-annoying way to go about it... any thoughts on that, > specifically as it pertains to this patch? I haven't looked at this patch at all. I'll try to make some time for it, but probably not today. regards, tom lane
On Fri, Jul 29, 2022 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm just uncertain whether what Amit has implemented is the > > least-annoying way to go about it... any thoughts on that, > > specifically as it pertains to this patch? > > I haven't looked at this patch at all. I'll try to make some > time for it, but probably not today. OK, thanks. The preliminary patch I'm talking about here is pretty short, so you could probably look at that part of it, at least, in some relatively small amount of time. And I think it's also in pretty reasonable shape apart from this issue. But, as usual, there's the question of how well one can evaluate a preliminary patch without reviewing the full patch in detail. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > 0001 adds es_part_prune_result but does not use it, so maybe the > > introduction of that field should be deferred until it's needed for > > something. > > Oops, looks like a mistake when breaking the patch. Will move that bit to 0002. Fixed that and also noticed that I had defined PartitionPruneResult in the wrong header (execnodes.h). That led to PartitionPruneResult nodes not being able to be written and read, because src/backend/nodes/gen_node_support.pl doesn't create _out* and _read* routines for the nodes defined in execnodes.h. I moved its definition to plannodes.h, even though it is not actually the planner that instantiates those; no other include/nodes header sounds better. One more thing I realized is that Bitmapsets added to the List PartitionPruneResult.valid_subplan_offs_list are not actually read/write-able. That's a problem that I also faced in [1], so I proposed a patch there to make Bitmapset a read/write-able Node and mark (only) the Bitmapsets that are added into read/write-able node trees with the corresponding NodeTag. I'm including that patch here as well (0002) for the main patch to work (pass -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense to discuss it in its own thread? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqH80qX1ZLx3HyHmBrOzLQeuKuGx6FzGep0F_9zw9L4PAA%40mail.gmail.com
Attachment
On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > 0001 adds es_part_prune_result but does not use it, so maybe the > > > introduction of that field should be deferred until it's needed for > > > something. > > > > Oops, looks like a mistake when breaking the patch. Will move that bit to 0002. > > Fixed that and also noticed that I had defined PartitionPruneResult in > the wrong header (execnodes.h). That led to PartitionPruneResult > nodes not being able to be written and read, because > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read* > routines for the nodes defined in execnodes.h. I moved its definition > to plannodes.h, even though it is not actually the planner that > instantiates those; no other include/nodes header sounds better. > > One more thing I realized is that Bitmapsets added to the List > PartitionPruneResult.valid_subplan_offs_list are not actually > read/write-able. That's a problem that I also faced in [1], so I > proposed a patch there to make Bitmapset a read/write-able Node and > mark (only) the Bitmapsets that are added into read/write-able node > trees with the corresponding NodeTag. I'm including that patch here > as well (0002) for the main patch to work (pass > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense > to discuss it in its own thread? Had second thoughts on the use of List of Bitmapsets for this, such that the make-Bitmapset-Nodes patch is no longer needed. I had defined PartitionPruneResult such that it stood for the results of pruning for all PartitionPruneInfos contained in PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that can use partition pruning in a given plan). So, it had a List of Bitmapset. I think it's perhaps better for PartitionPruneResult to cover only one PartitionPruneInfo and thus need only a Bitmapset and not a List thereof, which I have implemented in the attached updated patch 0002. So, instead of needing to pass around a PartitionPruneResult with each PlannedStmt, this now passes a List of PartitionPruneResult with an entry for each in PlannedStmt.partPruneInfos. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > 0001 adds es_part_prune_result but does not use it, so maybe the > > > > introduction of that field should be deferred until it's needed for > > > > something. > > > > > > Oops, looks like a mistake when breaking the patch. Will move that bit to 0002. > > > > Fixed that and also noticed that I had defined PartitionPruneResult in > > the wrong header (execnodes.h). That led to PartitionPruneResult > > nodes not being able to be written and read, because > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read* > > routines for the nodes defined in execnodes.h. I moved its definition > > to plannodes.h, even though it is not actually the planner that > > instantiates those; no other include/nodes header sounds better. > > > > One more thing I realized is that Bitmapsets added to the List > > PartitionPruneResult.valid_subplan_offs_list are not actually > > read/write-able. That's a problem that I also faced in [1], so I > > proposed a patch there to make Bitmapset a read/write-able Node and > > mark (only) the Bitmapsets that are added into read/write-able node > > trees with the corresponding NodeTag. I'm including that patch here > > as well (0002) for the main patch to work (pass > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense > > to discuss it in its own thread? > > Had second thoughts on the use of List of Bitmapsets for this, such > that the make-Bitmapset-Nodes patch is no longer needed. > > I had defined PartitionPruneResult such that it stood for the results > of pruning for all PartitionPruneInfos contained in > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that > can use partition pruning in a given plan). So, it had a List of > Bitmapset. I think it's perhaps better for PartitionPruneResult to > cover only one PartitionPruneInfo and thus need only a Bitmapset and > not a List thereof, which I have implemented in the attached updated > patch 0002. So, instead of needing to pass around a > PartitionPruneResult with each PlannedStmt, this now passes a List of > PartitionPruneResult with an entry for each in > PlannedStmt.partPruneInfos. Rebased over 3b2db22fe. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, Oct 27, 2022 at 11:41 AM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > 0001 adds es_part_prune_result but does not use it, so maybe the > > > > > introduction of that field should be deferred until it's needed for > > > > > something. > > > > > > > > Oops, looks like a mistake when breaking the patch. Will move that bit to 0002. > > > > > > Fixed that and also noticed that I had defined PartitionPruneResult in > > > the wrong header (execnodes.h). That led to PartitionPruneResult > > > nodes not being able to be written and read, because > > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read* > > > routines for the nodes defined in execnodes.h. I moved its definition > > > to plannodes.h, even though it is not actually the planner that > > > instantiates those; no other include/nodes header sounds better. > > > > > > One more thing I realized is that Bitmapsets added to the List > > > PartitionPruneResult.valid_subplan_offs_list are not actually > > > read/write-able. That's a problem that I also faced in [1], so I > > > proposed a patch there to make Bitmapset a read/write-able Node and > > > mark (only) the Bitmapsets that are added into read/write-able node > > > trees with the corresponding NodeTag. I'm including that patch here > > > as well (0002) for the main patch to work (pass > > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense > > > to discuss it in its own thread? > > > > Had second thoughts on the use of List of Bitmapsets for this, such > > that the make-Bitmapset-Nodes patch is no longer needed. > > > > I had defined PartitionPruneResult such that it stood for the results > > of pruning for all PartitionPruneInfos contained in > > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that > > can use partition pruning in a given plan). So, it had a List of > > Bitmapset. I think it's perhaps better for PartitionPruneResult to > > cover only one PartitionPruneInfo and thus need only a Bitmapset and > > not a List thereof, which I have implemented in the attached updated > > patch 0002. So, instead of needing to pass around a > > PartitionPruneResult with each PlannedStmt, this now passes a List of > > PartitionPruneResult with an entry for each in > > PlannedStmt.partPruneInfos. > > Rebased over 3b2db22fe. Updated 0002 to cope with AssertArg() being removed from the tree. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
Looking at 0001, I wonder if we should have a crosscheck that a PartitionPruneInfo you got from following an index is indeed constructed for the relation that you think it is: previously, you were always sure that the prune struct is for this node, because you followed a pointer that was set up in the node itself. Now you only have an index, and you have to trust that the index is correct. I'm not sure how to implement this, or even if it's doable at all. Keeping the OID of the partitioned table in the PartitionPruneInfo struct is easy, but I don't know how to check it in ExecInitMergeAppend and ExecInitAppend. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever" (Oliver Silfridge)
Hi Alvaro, Thanks for looking at this one. On Thu, Dec 1, 2022 at 3:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Looking at 0001, I wonder if we should have a crosscheck that a > PartitionPruneInfo you got from following an index is indeed constructed > for the relation that you think it is: previously, you were always sure > that the prune struct is for this node, because you followed a pointer > that was set up in the node itself. Now you only have an index, and you > have to trust that the index is correct. Yeah, a crosscheck sounds like a good idea. > I'm not sure how to implement this, or even if it's doable at all. > Keeping the OID of the partitioned table in the PartitionPruneInfo > struct is easy, but I don't know how to check it in ExecInitMergeAppend > and ExecInitAppend. Hmm, how about keeping the [Merge]Append's parent relation's RT index in the PartitionPruneInfo and passing it down to ExecInitPartitionPruning() from ExecInit[Merge]Append() for cross-checking? Both Append and MergeAppend already have a 'apprelids' field that we can save a copy of in the PartitionPruneInfo. Tried that in the attached delta patch. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On 2022-Dec-01, Amit Langote wrote: > Hmm, how about keeping the [Merge]Append's parent relation's RT index > in the PartitionPruneInfo and passing it down to > ExecInitPartitionPruning() from ExecInit[Merge]Append() for > cross-checking? Both Append and MergeAppend already have a > 'apprelids' field that we can save a copy of in the > PartitionPruneInfo. Tried that in the attached delta patch. Ah yeah, that sounds about what I was thinking. I've merged that in and pushed to github, which had a strange pg_upgrade failure on Windows mentioning log files that were not captured by the CI tooling. So I pushed another one trying to grab those files, in case it wasn't an one-off failure. It's running now: https://cirrus-ci.com/task/5857239638999040 If all goes well with this run, I'll get this 0001 pushed. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Dec-01, Amit Langote wrote: > > Hmm, how about keeping the [Merge]Append's parent relation's RT index > > in the PartitionPruneInfo and passing it down to > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for > > cross-checking? Both Append and MergeAppend already have a > > 'apprelids' field that we can save a copy of in the > > PartitionPruneInfo. Tried that in the attached delta patch. > > Ah yeah, that sounds about what I was thinking. I've merged that in and > pushed to github, which had a strange pg_upgrade failure on Windows > mentioning log files that were not captured by the CI tooling. So I > pushed another one trying to grab those files, in case it wasn't an > one-off failure. It's running now: > https://cirrus-ci.com/task/5857239638999040 > > If all goes well with this run, I'll get this 0001 pushed. Thanks for pushing 0001. Rebased 0002 attached. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-Dec-01, Amit Langote wrote: > > > Hmm, how about keeping the [Merge]Append's parent relation's RT index > > > in the PartitionPruneInfo and passing it down to > > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for > > > cross-checking? Both Append and MergeAppend already have a > > > 'apprelids' field that we can save a copy of in the > > > PartitionPruneInfo. Tried that in the attached delta patch. > > > > Ah yeah, that sounds about what I was thinking. I've merged that in and > > pushed to github, which had a strange pg_upgrade failure on Windows > > mentioning log files that were not captured by the CI tooling. So I > > pushed another one trying to grab those files, in case it wasn't an > > one-off failure. It's running now: > > https://cirrus-ci.com/task/5857239638999040 > > > > If all goes well with this run, I'll get this 0001 pushed. > > Thanks for pushing 0001. > > Rebased 0002 attached. Thought it might be good for PartitionPruneResult to also have root_parent_relids that matches with the corresponding PartitionPruneInfo. ExecInitPartitionPruning() does a sanity check that the root_parent_relids of a given pair of PartitionPrune{Info | Result} match. Posting the patch separately as the attached 0002, just in case you might think that the extra cross-checking would be an overkill. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > On 2022-Dec-01, Amit Langote wrote: > > > > Hmm, how about keeping the [Merge]Append's parent relation's RT index > > > > in the PartitionPruneInfo and passing it down to > > > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for > > > > cross-checking? Both Append and MergeAppend already have a > > > > 'apprelids' field that we can save a copy of in the > > > > PartitionPruneInfo. Tried that in the attached delta patch. > > > > > > Ah yeah, that sounds about what I was thinking. I've merged that in and > > > pushed to github, which had a strange pg_upgrade failure on Windows > > > mentioning log files that were not captured by the CI tooling. So I > > > pushed another one trying to grab those files, in case it wasn't an > > > one-off failure. It's running now: > > > https://cirrus-ci.com/task/5857239638999040 > > > > > > If all goes well with this run, I'll get this 0001 pushed. > > > > Thanks for pushing 0001. > > > > Rebased 0002 attached. > > Thought it might be good for PartitionPruneResult to also have > root_parent_relids that matches with the corresponding > PartitionPruneInfo. ExecInitPartitionPruning() does a sanity check > that the root_parent_relids of a given pair of PartitionPrune{Info | > Result} match. > > Posting the patch separately as the attached 0002, just in case you > might think that the extra cross-checking would be an overkill. Rebased over 92c4dafe1eed and fixed some factual mistakes in the comment above ExecutorDoInitialPruning(). -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 5, 2022 at 12:00 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Thought it might be good for PartitionPruneResult to also have > > root_parent_relids that matches with the corresponding > > PartitionPruneInfo. ExecInitPartitionPruning() does a sanity check > > that the root_parent_relids of a given pair of PartitionPrune{Info | > > Result} match. > > > > Posting the patch separately as the attached 0002, just in case you > > might think that the extra cross-checking would be an overkill. > > Rebased over 92c4dafe1eed and fixed some factual mistakes in the > comment above ExecutorDoInitialPruning(). Sorry, I had forgotten to git-add hunks including some cosmetic changes in that one. Here's another version. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
I find the API of GetCachedPlans a little weird after this patch. I think it may be better to have it return a pointer of a new struct -- one that contains both the CachedPlan pointer and the list of pruning results. (As I understand, the sole caller that isn't interested in the pruning results, SPI_plan_get_cached_plan, can be explained by the fact that it knows there won't be any. So I don't think we need to worry about this case?) And I think you should make that struct also be the last argument of PortalDefineQuery, so you don't need the separate PortalStorePartitionPruneResults function -- because as far as I can tell, the callers that pass a non-NULL pointer there are the exactly same that later call PortalStorePartitionPruneResults. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
Thanks for the review. On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I find the API of GetCachedPlans a little weird after this patch. I > think it may be better to have it return a pointer of a new struct -- > one that contains both the CachedPlan pointer and the list of pruning > results. (As I understand, the sole caller that isn't interested in the > pruning results, SPI_plan_get_cached_plan, can be explained by the fact > that it knows there won't be any. So I don't think we need to worry > about this case?) David, in his Apr 7 reply on this thread, also sounded to suggest something similar. Hmm, I was / am not so sure if GetCachedPlan() should return something that is not CachedPlan. An idea I had today was to replace the part_prune_results_list output List parameter with, say, QueryInitPruningResult, or something like that and put the current list into that struct. Was looking at QueryEnvironment to come up with *that* name. Any thoughts? > And I think you should make that struct also be the last argument of > PortalDefineQuery, so you don't need the separate > PortalStorePartitionPruneResults function -- because as far as I can > tell, the callers that pass a non-NULL pointer there are the exactly > same that later call PortalStorePartitionPruneResults. Yes, it would be better to not need PortalStorePartitionPruneResults. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On 2022-Dec-09, Amit Langote wrote: > On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I find the API of GetCachedPlans a little weird after this patch. > David, in his Apr 7 reply on this thread, also sounded to suggest > something similar. > > Hmm, I was / am not so sure if GetCachedPlan() should return something > that is not CachedPlan. An idea I had today was to replace the > part_prune_results_list output List parameter with, say, > QueryInitPruningResult, or something like that and put the current > list into that struct. Was looking at QueryEnvironment to come up > with *that* name. Any thoughts? Remind me again why is part_prune_results_list not part of struct CachedPlan then? I tried to understand that based on comments upthread, but I was unable to find anything. (My first reaction to your above comment was "well, rename GetCachedPlan then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is in any way a structure that must be "immutable" in the way parser output is. Looking at the comment at the top of plancache.c it appears to me that it isn't, but maybe I'm missing something.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The Postgresql hackers have what I call a "NASA space shot" mentality. Quite refreshing in a world of "weekend drag racer" developers." (Scott Marlowe)
On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Dec-09, Amit Langote wrote: > > On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > I find the API of GetCachedPlans a little weird after this patch. > > > David, in his Apr 7 reply on this thread, also sounded to suggest > > something similar. > > > > Hmm, I was / am not so sure if GetCachedPlan() should return something > > that is not CachedPlan. An idea I had today was to replace the > > part_prune_results_list output List parameter with, say, > > QueryInitPruningResult, or something like that and put the current > > list into that struct. Was looking at QueryEnvironment to come up > > with *that* name. Any thoughts? > > Remind me again why is part_prune_results_list not part of struct > CachedPlan then? I tried to understand that based on comments upthread, > but I was unable to find anything. It used to be part of CachedPlan for a brief period of time (in patch v12 I posted in [1]), but David, in his reply to [1], said he wasn't so sure that it belonged there. > (My first reaction to your above comment was "well, rename GetCachedPlan > then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is > in any way a structure that must be "immutable" in the way parser output > is. Looking at the comment at the top of plancache.c it appears to me > that it isn't, but maybe I'm missing something.) CachedPlan *is* supposed to be read-only per the comment above CachedPlanSource definition: * ...If we are using a generic * cached plan then it is meant to be re-used across multiple executions, so * callers must always treat CachedPlans as read-only. FYI, there was even an idea of putting a PartitionPruneResults for a given PlannedStmt into the PlannedStmt itself [2], but PlannedStmt is supposed to be read-only too [3]. Maybe we need some new overarching context when invoking plancache, if Portal can't already be it, whose struct can be passed to GetCachedPlan() to put the pruning results in? Perhaps, GetRunnablePlan() that you floated could be a wrapper for GetCachedPlan(), owning that new context. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqH4qQ_YVROr7TY0jSCuGn0oHhH79_DswOdXWN5UnMCBtQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAApHDvp_DjVVkgSV24%2BUF7p_yKWeepgoo%2BW2SWLLhNmjwHTVYQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/922566.1648784745%40sss.pgh.pa.us
On 2022-Dec-09, Amit Langote wrote: > On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Remind me again why is part_prune_results_list not part of struct > > CachedPlan then? I tried to understand that based on comments upthread, > > but I was unable to find anything. > > It used to be part of CachedPlan for a brief period of time (in patch > v12 I posted in [1]), but David, in his reply to [1], said he wasn't > so sure that it belonged there. I'm not sure I necessarily agree with that. I'll have a look at v12 to try and understand what was David so unhappy about. > > (My first reaction to your above comment was "well, rename GetCachedPlan > > then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is > > in any way a structure that must be "immutable" in the way parser output > > is. Looking at the comment at the top of plancache.c it appears to me > > that it isn't, but maybe I'm missing something.) > > CachedPlan *is* supposed to be read-only per the comment above > CachedPlanSource definition: > > * ...If we are using a generic > * cached plan then it is meant to be re-used across multiple executions, so > * callers must always treat CachedPlans as read-only. I read that as implying that the part_prune_results_list must remain intact as long as no invalidations occur. Does part_prune_result_list really change as a result of something other than a sinval event? Keep in mind that if a sinval message that touches one of the relations in the plan arrives, then we'll discard it and generate it afresh. I don't see that the part_prune_results_list would change otherwise, but maybe I misunderstand? > FYI, there was even an idea of putting a PartitionPruneResults for a > given PlannedStmt into the PlannedStmt itself [2], but PlannedStmt is > supposed to be read-only too [3]. Hmm, I'm not familiar with PlannedStmt lifetime, but I'm definitely not betting that Tom is wrong about this. > Maybe we need some new overarching context when invoking plancache, if > Portal can't already be it, whose struct can be passed to > GetCachedPlan() to put the pruning results in? Perhaps, > GetRunnablePlan() that you floated could be a wrapper for > GetCachedPlan(), owning that new context. Perhaps that is a solution. I'm not sure. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
On Fri, Dec 9, 2022 at 7:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Dec-09, Amit Langote wrote: > > On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > Remind me again why is part_prune_results_list not part of struct > > > CachedPlan then? I tried to understand that based on comments upthread, > > > but I was unable to find anything. > > > > > (My first reaction to your above comment was "well, rename GetCachedPlan > > > then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is > > > in any way a structure that must be "immutable" in the way parser output > > > is. Looking at the comment at the top of plancache.c it appears to me > > > that it isn't, but maybe I'm missing something.) > > > > CachedPlan *is* supposed to be read-only per the comment above > > CachedPlanSource definition: > > > > * ...If we are using a generic > > * cached plan then it is meant to be re-used across multiple executions, so > > * callers must always treat CachedPlans as read-only. > > I read that as implying that the part_prune_results_list must remain > intact as long as no invalidations occur. Does part_prune_result_list > really change as a result of something other than a sinval event? > Keep in mind that if a sinval message that touches one of the relations > in the plan arrives, then we'll discard it and generate it afresh. I > don't see that the part_prune_results_list would change otherwise, but > maybe I misunderstand? Pruning will be done afresh on every fetch of a given cached plan when CheckCachedPlan() is called on it, so the part_prune_results_list part will be discarded and rebuilt as many times as the plan is executed. You'll find a description around CachedPlanSavePartitionPruneResults() that's in v12. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On 2022-Dec-09, Amit Langote wrote: > Pruning will be done afresh on every fetch of a given cached plan when > CheckCachedPlan() is called on it, so the part_prune_results_list part > will be discarded and rebuilt as many times as the plan is executed. > You'll find a description around CachedPlanSavePartitionPruneResults() > that's in v12. I see. In that case, a separate container struct seems warranted. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Industry suffers from the managerial dogma that for the sake of stability and continuity, the company should be independent of the competence of individual employees." (E. Dijkstra)
On Fri, Dec 9, 2022 at 8:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Dec-09, Amit Langote wrote: > > > Pruning will be done afresh on every fetch of a given cached plan when > > CheckCachedPlan() is called on it, so the part_prune_results_list part > > will be discarded and rebuilt as many times as the plan is executed. > > You'll find a description around CachedPlanSavePartitionPruneResults() > > that's in v12. > > I see. > > In that case, a separate container struct seems warranted. I thought about this today and played around with some container struct ideas. Though, I started feeling like putting all the new logic being added by this patch into plancache.c at the heart of GetCachedPlan() and tweaking its API in kind of unintuitive ways may not have been such a good idea to begin with. So I started thinking again about your GetRunnablePlan() wrapper idea and thought maybe we could do something with it. Let's say we name it GetCachedPlanLockPartitions() and put the logic that does initial pruning with the new ExecutorDoInitialPruning() in it, instead of in the normal GetCachedPlan() path. Any callers that call GetCachedPlan() instead call GetCachedPlanLockPartitions() with either the List ** parameter as now or some container struct if that seems better. Whether GetCachedPlanLockPartitions() needs to do anything other than return the CachedPlan returned by GetCachedPlan() can be decided by the latter setting, say, CachedPlan.has_unlocked_partitions. That will be done by AcquireExecutorLocks() when it sees containsInitialPrunnig in any of the PlannedStmts it sees, locking only the PlannedStmt.minLockRelids set (which is all relations where no pruning is needed!), leaving the partition locking to GetCachedPlanLockPartitions(). If the CachedPlan is invalidated during the partition locking phase, it calls GetCachedPlan() again; maybe some refactoring is needed to avoid too much useless work in such cases. Thoughts? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On 2022-Dec-12, Amit Langote wrote: > I started feeling like putting all the new logic being added > by this patch into plancache.c at the heart of GetCachedPlan() and > tweaking its API in kind of unintuitive ways may not have been such a > good idea to begin with. So I started thinking again about your > GetRunnablePlan() wrapper idea and thought maybe we could do something > with it. Let's say we name it GetCachedPlanLockPartitions() and put > the logic that does initial pruning with the new > ExecutorDoInitialPruning() in it, instead of in the normal > GetCachedPlan() path. Any callers that call GetCachedPlan() instead > call GetCachedPlanLockPartitions() with either the List ** parameter > as now or some container struct if that seems better. Whether > GetCachedPlanLockPartitions() needs to do anything other than return > the CachedPlan returned by GetCachedPlan() can be decided by the > latter setting, say, CachedPlan.has_unlocked_partitions. That will be > done by AcquireExecutorLocks() when it sees containsInitialPrunnig in > any of the PlannedStmts it sees, locking only the > PlannedStmt.minLockRelids set (which is all relations where no pruning > is needed!), leaving the partition locking to > GetCachedPlanLockPartitions(). Hmm. This doesn't sound totally unreasonable, except to the point David was making that perhaps we may want this container struct to accomodate other things in the future than just the partition pruning results, so I think its name (and that of the function that produces it) ought to be a little more generic than that. (I think this also answers your question on whether a List ** is better than a container struct.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)
On Tue, Dec 13, 2022 at 2:24 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2022-Dec-12, Amit Langote wrote: > > I started feeling like putting all the new logic being added > > by this patch into plancache.c at the heart of GetCachedPlan() and > > tweaking its API in kind of unintuitive ways may not have been such a > > good idea to begin with. So I started thinking again about your > > GetRunnablePlan() wrapper idea and thought maybe we could do something > > with it. Let's say we name it GetCachedPlanLockPartitions() and put > > the logic that does initial pruning with the new > > ExecutorDoInitialPruning() in it, instead of in the normal > > GetCachedPlan() path. Any callers that call GetCachedPlan() instead > > call GetCachedPlanLockPartitions() with either the List ** parameter > > as now or some container struct if that seems better. Whether > > GetCachedPlanLockPartitions() needs to do anything other than return > > the CachedPlan returned by GetCachedPlan() can be decided by the > > latter setting, say, CachedPlan.has_unlocked_partitions. That will be > > done by AcquireExecutorLocks() when it sees containsInitialPrunnig in > > any of the PlannedStmts it sees, locking only the > > PlannedStmt.minLockRelids set (which is all relations where no pruning > > is needed!), leaving the partition locking to > > GetCachedPlanLockPartitions(). > > Hmm. This doesn't sound totally unreasonable, except to the point David > was making that perhaps we may want this container struct to accomodate > other things in the future than just the partition pruning results, so I > think its name (and that of the function that produces it) ought to be a > little more generic than that. > > (I think this also answers your question on whether a List ** is better > than a container struct.) OK, so here's a WIP attempt at that. I have moved the original functionality of GetCachedPlan() to GetCachedPlanInternal(), turning the former into a sort of controller as described shortly. The latter's CheckCachedPlan() part now only locks the "minimal" set of, non-prunable, relations, making a note of whether the plan contains any prunable subnodes and thus prunable relations whose locking is deferred to the caller, GetCachedPlan(). GetCachedPlan(), as a sort of controller as mentioned before, does the pruning if needed on the minimally valid plan returned by GetCachedPlanInternal(), locks the partitions that survive, and redoes the whole thing if the locking of partitions invalidates the plan. The pruning results are returned through the new output parameter of GetCachedPlan() of type CachedPlanExtra. I named it so after much consideration, because all the new logic that produces stuff to put into it is a part of the plancache module and has to do with manipulating a CachedPlan. (I had considered CachedPlanExecInfo to indicate that it contains information that is to be forwarded to the executor, though that just didn't seem to fit in plancache.h.) I have broken out a few things into a preparatory patch 0001. Mainly, it invents PlannedStmt.minLockRelids to replace the AcquireExecutorLocks()'s current loop over the range table to figure out the relations to lock. I also threw in a couple of pruning related non-functional changes in there to make it easier to read the 0002, which is the main patch. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Wed, Dec 14, 2022 at 5:35 PM Amit Langote <amitlangote09@gmail.com> wrote: > I have moved the original functionality of GetCachedPlan() to > GetCachedPlanInternal(), turning the former into a sort of controller > as described shortly. The latter's CheckCachedPlan() part now only > locks the "minimal" set of, non-prunable, relations, making a note of > whether the plan contains any prunable subnodes and thus prunable > relations whose locking is deferred to the caller, GetCachedPlan(). > GetCachedPlan(), as a sort of controller as mentioned before, does the > pruning if needed on the minimally valid plan returned by > GetCachedPlanInternal(), locks the partitions that survive, and redoes > the whole thing if the locking of partitions invalidates the plan. After sleeping on it, I realized this doesn't have to be that complicated. Rather than turn GetCachedPlan() into a wrapper for handling deferred partition locking as outlined above, I could have changed it more simply as follows to get the same thing done: if (!customplan) { - if (CheckCachedPlan(plansource)) + bool hasUnlockedParts = false; + + if (CheckCachedPlan(plansource, &hasUnlockedParts) && + hasUnlockedParts && + CachedPlanLockPartitions(plansource, boundParams, owner, extra)) { /* We want a generic plan, and we already have a valid one */ plan = plansource->gplan; Attached updated patch does it like that. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
This version of the patch looks not entirely unreasonable to me. I'll set this as Ready for Committer in case David or Tom or someone else want to have a look and potentially commit it. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Wed, Dec 21, 2022 at 7:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > This version of the patch looks not entirely unreasonable to me. I'll > set this as Ready for Committer in case David or Tom or someone else > want to have a look and potentially commit it. Thank you, Alvaro. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > This version of the patch looks not entirely unreasonable to me. I'll > set this as Ready for Committer in case David or Tom or someone else > want to have a look and potentially commit it. I will have a look during the January CF. regards, tom lane
I spent some time re-reading this whole thread, and the more I read the less happy I got. We are adding a lot of complexity and introducing coding hazards that will surely bite somebody someday. And after awhile I had what felt like an epiphany: the whole problem arises because the system is wrongly factored. We should get rid of AcquireExecutorLocks altogether, allowing the plancache to hand back a generic plan that it's not certain of the validity of, and instead integrate the responsibility for acquiring locks into executor startup. It'd have to be optional there, since we don't need new locks in the case of executing a just-planned plan; but we can easily add another eflags bit (EXEC_FLAG_GET_LOCKS or so). Then there has to be a convention whereby the ExecInitNode traversal can return an indicator that "we failed because the plan is stale, please make a new plan". There are a couple reasons why this feels like a good idea: * There's no need for worry about keeping the locking decisions in sync with what executor startup does. * We don't need to add the overhead proposed in the current patch to pass forward data about what got locked/pruned. While that overhead is hopefully less expensive than the locks it saved acquiring, it's still overhead (and in some cases the patch will fail to save acquiring any locks, making it certainly a net negative). * In a successfully built execution state tree, there will simply not be any nodes corresponding to pruned-away, never-locked subplans. As long as code like EXPLAIN follows the state tree and doesn't poke into plan nodes that have no matching state, it's secure against the sort of problems that Robert worried about upthread. While I've not attempted to write any code for this, I can also think of a few issues that'd have to be resolved: * We'd be pushing the responsibility for looping back and re-planning out to fairly high-level calling code. There are only half a dozen callers of GetCachedPlan, so there's not that many places to be touched; but in some of those places the subsequent executor-start call is not close by, so that the necessary refactoring might be pretty painful. I doubt there's anything insurmountable, but we'd definitely be changing some fundamental APIs. * In some cases (views, at least) we need to acquire lock on relations that aren't directly reflected anywhere in the plan tree. So there'd have to be a separate mechanism for getting those locks and rechecking validity afterward. A list of relevant relation OIDs might be enough for that. * We currently do ExecCheckPermissions() before initializing the plan state tree. It won't do to check permissions on relations we haven't yet locked, so that responsibility would have to be moved. Maybe that could also be integrated into the initialization recursion? Not sure. * In the existing usage of AcquireExecutorLocks, if we do decide that the plan is stale then we are able to release all the locks we got before we go off and replan. I'm not certain if that behavior needs to be preserved, but if it does then that would require some additional bookkeeping in the executor. * This approach is optimizing on the assumption that we usually won't need to replan, because if we do then we might waste a fair amount of executor startup overhead before discovering we have to throw all that state away. I think that's clearly the right way to bet, but perhaps somebody else has a different view. Thoughts? regards, tom lane
On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I spent some time re-reading this whole thread, and the more I read > the less happy I got. Thanks a lot for your time on this. > We are adding a lot of complexity and introducing > coding hazards that will surely bite somebody someday. And after awhile > I had what felt like an epiphany: the whole problem arises because the > system is wrongly factored. We should get rid of AcquireExecutorLocks > altogether, allowing the plancache to hand back a generic plan that > it's not certain of the validity of, and instead integrate the > responsibility for acquiring locks into executor startup. It'd have > to be optional there, since we don't need new locks in the case of > executing a just-planned plan; but we can easily add another eflags > bit (EXEC_FLAG_GET_LOCKS or so). Then there has to be a convention > whereby the ExecInitNode traversal can return an indicator that > "we failed because the plan is stale, please make a new plan". Interesting. The current implementation relies on PlanCacheRelCallback() marking a generic CachedPlan as invalid, so perhaps there will have to be some sharing of state between the plancache and the executor for this to work? > There are a couple reasons why this feels like a good idea: > > * There's no need for worry about keeping the locking decisions in sync > with what executor startup does. > > * We don't need to add the overhead proposed in the current patch to > pass forward data about what got locked/pruned. While that overhead > is hopefully less expensive than the locks it saved acquiring, it's > still overhead (and in some cases the patch will fail to save acquiring > any locks, making it certainly a net negative). > > * In a successfully built execution state tree, there will simply > not be any nodes corresponding to pruned-away, never-locked subplans. > As long as code like EXPLAIN follows the state tree and doesn't poke > into plan nodes that have no matching state, it's secure against the > sort of problems that Robert worried about upthread. I think this is true with the patch as proposed too, but I was still a bit worried about what an ExecutorStart_hook may be doing with an uninitialized plan tree. Maybe we're mandating that the hook must call standard_ExecutorStart() and only work with the finished PlanState tree? > While I've not attempted to write any code for this, I can also > think of a few issues that'd have to be resolved: > > * We'd be pushing the responsibility for looping back and re-planning > out to fairly high-level calling code. There are only half a dozen > callers of GetCachedPlan, so there's not that many places to be > touched; but in some of those places the subsequent executor-start call > is not close by, so that the necessary refactoring might be pretty > painful. I doubt there's anything insurmountable, but we'd definitely > be changing some fundamental APIs. Yeah. I suppose mostly the same place that the current patch is touching to pass around the PartitionPruneResult nodes. > * In some cases (views, at least) we need to acquire lock on relations > that aren't directly reflected anywhere in the plan tree. So there'd > have to be a separate mechanism for getting those locks and rechecking > validity afterward. A list of relevant relation OIDs might be enough > for that. Hmm, a list of only the OIDs wouldn't preserve the lock mode, so maybe a list or bitmapset of the RTIs, something along the lines of PlannedStmt.minLockRelids in the patch? It perhaps even makes sense to make a special list in PlannedStmt for only the views? > * We currently do ExecCheckPermissions() before initializing the > plan state tree. It won't do to check permissions on relations we > haven't yet locked, so that responsibility would have to be moved. > Maybe that could also be integrated into the initialization recursion? > Not sure. Ah, I remember mentioning moving that into ExecGetRangeTableRelation() [1], but I guess that misses relations that are not referenced in the plan tree, such as views. Though maybe that's not a problem if we track views separately as mentioned above. > * In the existing usage of AcquireExecutorLocks, if we do decide > that the plan is stale then we are able to release all the locks > we got before we go off and replan. I'm not certain if that behavior > needs to be preserved, but if it does then that would require some > additional bookkeeping in the executor. I think maybe we'll want to continue to release the existing locks, because if we don't, it's possible we may keep some locks uselessly if replanning might lock a different set of relations. > * This approach is optimizing on the assumption that we usually > won't need to replan, because if we do then we might waste a fair > amount of executor startup overhead before discovering we have > to throw all that state away. I think that's clearly the right > way to bet, but perhaps somebody else has a different view. Not sure if you'd like, because it would still keep the PartitionPruneResult business, but this will be less of a problem if we do the initial pruning at the beginning of InitPlan(), followed by locking, before doing anything else. We would have initialized the QueryDesc and the EState, but only minimally. That also keeps the PartitionPruneResult business local to the executor. Would you like me to hack up a PoC or are you already on that? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CA%2BHiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA%40mail.gmail.com
Amit Langote <amitlangote09@gmail.com> writes: > On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I had what felt like an epiphany: the whole problem arises because the >> system is wrongly factored. We should get rid of AcquireExecutorLocks >> altogether, allowing the plancache to hand back a generic plan that >> it's not certain of the validity of, and instead integrate the >> responsibility for acquiring locks into executor startup. > Interesting. The current implementation relies on > PlanCacheRelCallback() marking a generic CachedPlan as invalid, so > perhaps there will have to be some sharing of state between the > plancache and the executor for this to work? Yeah. Thinking a little harder, I think this would have to involve passing a CachedPlan pointer to the executor, and what the executor would do after acquiring each lock is to ask the plancache "hey, do you still think this CachedPlan entry is valid?". In the case where there's a problem, the AcceptInvalidationMessages call involved in lock acquisition would lead to a cache inval that clears the validity flag on the CachedPlan entry, and this would provide an inexpensive way to check if that happened. It might be possible to incorporate this pointer into PlannedStmt instead of passing it separately. >> * In a successfully built execution state tree, there will simply >> not be any nodes corresponding to pruned-away, never-locked subplans. > I think this is true with the patch as proposed too, but I was still a > bit worried about what an ExecutorStart_hook may be doing with an > uninitialized plan tree. Maybe we're mandating that the hook must > call standard_ExecutorStart() and only work with the finished > PlanState tree? It would certainly be incumbent on any such hook to not touch not-yet-locked parts of the plan tree. I'm not particularly concerned about that sort of requirements change, because we'd be breaking APIs all through this area in any case. >> * In some cases (views, at least) we need to acquire lock on relations >> that aren't directly reflected anywhere in the plan tree. So there'd >> have to be a separate mechanism for getting those locks and rechecking >> validity afterward. A list of relevant relation OIDs might be enough >> for that. > Hmm, a list of only the OIDs wouldn't preserve the lock mode, Good point. I wonder if we could integrate this with the RTEPermissionInfo data structure? > Would you like me to hack up a PoC or are you already on that? I'm not planning to work on this myself, I was hoping you would. regards, tom lane
On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I had what felt like an epiphany: the whole problem arises because the > >> system is wrongly factored. We should get rid of AcquireExecutorLocks > >> altogether, allowing the plancache to hand back a generic plan that > >> it's not certain of the validity of, and instead integrate the > >> responsibility for acquiring locks into executor startup. > > > Interesting. The current implementation relies on > > PlanCacheRelCallback() marking a generic CachedPlan as invalid, so > > perhaps there will have to be some sharing of state between the > > plancache and the executor for this to work? > > Yeah. Thinking a little harder, I think this would have to involve > passing a CachedPlan pointer to the executor, and what the executor > would do after acquiring each lock is to ask the plancache "hey, do > you still think this CachedPlan entry is valid?". In the case where > there's a problem, the AcceptInvalidationMessages call involved in > lock acquisition would lead to a cache inval that clears the validity > flag on the CachedPlan entry, and this would provide an inexpensive > way to check if that happened. OK, thanks, this is useful. > It might be possible to incorporate this pointer into PlannedStmt > instead of passing it separately. Yeah, that would be less churn. Though, I wonder if you still hold that PlannedStmt should not be scribbled upon outside the planner as you said upthread [1]? > >> * In a successfully built execution state tree, there will simply > >> not be any nodes corresponding to pruned-away, never-locked subplans. > > > I think this is true with the patch as proposed too, but I was still a > > bit worried about what an ExecutorStart_hook may be doing with an > > uninitialized plan tree. Maybe we're mandating that the hook must > > call standard_ExecutorStart() and only work with the finished > > PlanState tree? > > It would certainly be incumbent on any such hook to not touch > not-yet-locked parts of the plan tree. I'm not particularly concerned > about that sort of requirements change, because we'd be breaking APIs > all through this area in any case. OK. Perhaps something that should be documented around ExecutorStart(). > >> * In some cases (views, at least) we need to acquire lock on relations > >> that aren't directly reflected anywhere in the plan tree. So there'd > >> have to be a separate mechanism for getting those locks and rechecking > >> validity afterward. A list of relevant relation OIDs might be enough > >> for that. > > > Hmm, a list of only the OIDs wouldn't preserve the lock mode, > > Good point. I wonder if we could integrate this with the > RTEPermissionInfo data structure? You mean adding a rellockmode field to RTEPermissionInfo? > > Would you like me to hack up a PoC or are you already on that? > > I'm not planning to work on this myself, I was hoping you would. Alright, I'll try to get something out early next week. Thanks for all the pointers. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/922566.1648784745%40sss.pgh.pa.us
Amit Langote <amitlangote09@gmail.com> writes: > On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It might be possible to incorporate this pointer into PlannedStmt >> instead of passing it separately. > Yeah, that would be less churn. Though, I wonder if you still hold > that PlannedStmt should not be scribbled upon outside the planner as > you said upthread [1]? Well, the whole point of that rule is that the executor can't modify a plancache entry. If the plancache itself sets a field in such an entry, that doesn't seem problematic from here. But there's other possibilities if that bothers you; QueryDesc could hold the field, for example. Also, I bet we'd want to copy it into EState for the main initialization recursion. regards, tom lane
On Fri, Jan 20, 2023 at 12:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> It might be possible to incorporate this pointer into PlannedStmt > >> instead of passing it separately. > > > Yeah, that would be less churn. Though, I wonder if you still hold > > that PlannedStmt should not be scribbled upon outside the planner as > > you said upthread [1]? > > Well, the whole point of that rule is that the executor can't modify > a plancache entry. If the plancache itself sets a field in such an > entry, that doesn't seem problematic from here. > > But there's other possibilities if that bothers you; QueryDesc > could hold the field, for example. Also, I bet we'd want to copy > it into EState for the main initialization recursion. QueryDesc sounds good to me, and yes, also a copy in EState in any case. So I started looking at the call sites of CreateQueryDesc() and stopped to look at ExecParallelGetQueryDesc(). AFAICS, we wouldn't need to pass the CachedPlan to a parallel worker's rerun of InitPlan(), because 1) it doesn't make sense to call the plancache in a parallel worker, 2) the leader should already have taken all the locks in necessary for executing a given plan subnode that it intends to pass to a worker in ExecInitGather(). Does that make sense? -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote: > Alright, I'll try to get something out early next week. Thanks for > all the pointers. Sorry for the delay. Attached is what I've come up with so far. I didn't actually go with calling the plancache on every lock taken on a relation, that is, in ExecGetRangeTableRelation(). One thing about doing it that way that I didn't quite like (or didn't see a clean enough way to code) is the need to complicate the ExecInitNode() traversal for handling the abrupt suspension of the ongoing setup of the PlanState tree. So, I decided to keep the current model of locking all the relations that need to be locked before doing anything else in InitPlan(), much as how AcquireExecutorLocks() does it. A new function called from the top of InitPlan that I've called ExecLockRelationsIfNeeded() does that locking after performing the initial pruning in the same manner as the earlier patch did. That does mean that I needed to keep all the adjustments of the pruning code that are required for such out-of-ExecInitNode() invocation of initial pruning, including those PartitionPruneResult to carry the result of that pruning for ExecInitNode()-time reuse, though they no longer need be passed through many unrelated interfaces. Anyways, here's a description of the patches: 0001 adjusts various call sites of ExecutorStart() to cope with the possibility of being asked to recreate a CachedPlan, if one is involved. The main objective here is to have as little stuff as sensible happen between GetCachedPlan() that returned the CachedPlan and ExecutorStart() so as to minimize the chances of missing cleaning up resources that must not be missed. 0002 is preparatory refactoring to make out-of-ExecInitNode() invocation of pruning possible. 0003 moves the responsibility of CachedPlan validation locking into ExecutorStart() as described above. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Alright, I'll try to get something out early next week. Thanks for > > all the pointers. > > Sorry for the delay. Attached is what I've come up with so far. > > I didn't actually go with calling the plancache on every lock taken on > a relation, that is, in ExecGetRangeTableRelation(). One thing about > doing it that way that I didn't quite like (or didn't see a clean > enough way to code) is the need to complicate the ExecInitNode() > traversal for handling the abrupt suspension of the ongoing setup of > the PlanState tree. OK, I gave this one more try and attached is what I came up with. This adds a ExecPlanStillValid(), which is called right after anything that may in turn call ExecGetRangeTableRelation() which has been taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in EState.es_top_eflags. That includes all ExecInitNode() calls, and a few other functions that call ExecGetRangeTableRelation() directly, such as ExecOpenScanRelation(). If ExecPlanStillValid() returns false, that is, if EState.es_cachedplan is found to have been invalidated after a lock being taken by ExecGetRangeTableRelation(), whatever funcion called it must return immediately and so must its caller and so on. ExecEndPlan() seems to be able to clean up after a partially finished attempt of initializing a PlanState tree in this way. Maybe my preliminary testing didn't catch cases where pointers to resources that are normally put into the nodes of a PlanState tree are now left dangling, because a partially built PlanState tree is not accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in such cases. Maybe there's only es_tupleTable and es_relations that needs to be explicitly released and the rest is taken care of by resetting the ExecutorState context. On testing, I'm afraid we're going to need something like src/test/modules/delay_execution to test that concurrent changes to relation(s) in PlannedStmt.relationOids that occur somewhere between RevalidateCachedQuery() and InitPlan() result in the latter to be aborted and that it is handled correctly. It seems like it is only the locking of partitions (that are not present in an unplanned Query and thus not protected by AcquirePlannerLocks()) that can trigger replanning of a CachedPlan, so any tests we write should involve partitions. Should this try to test as many plan shapes as possible though given the uncertainty around ExecEndPlan() robustness or should manual auditing suffice to be sure that nothing's broken? On possibly needing to move permission checking to occur *after* taking locks, I realized that we don't really need to, because no relation that needs its permissions should be unlocked by the time we get to ExecCheckPermissions(); note we only check permissions of tables that are present in the original parse tree and RevalidateCachedQuery() should have locked those. I found a couple of exceptions to that invariant in that views sometimes appear not to be in the set of relations that RevalidateCachedQuery() locks. So, I invented PlannedStmt.viewRelations, a list of RT indexes of view RTEs that is populated in setrefs.c. ExecLockViewRelations() called before ExecCheckPermissions() locks those. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, Feb 2, 2023 at 11:49 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote: > > I didn't actually go with calling the plancache on every lock taken on > > a relation, that is, in ExecGetRangeTableRelation(). One thing about > > doing it that way that I didn't quite like (or didn't see a clean > > enough way to code) is the need to complicate the ExecInitNode() > > traversal for handling the abrupt suspension of the ongoing setup of > > the PlanState tree. > > OK, I gave this one more try and attached is what I came up with. > > This adds a ExecPlanStillValid(), which is called right after anything > that may in turn call ExecGetRangeTableRelation() which has been > taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in > EState.es_top_eflags. That includes all ExecInitNode() calls, and a > few other functions that call ExecGetRangeTableRelation() directly, > such as ExecOpenScanRelation(). If ExecPlanStillValid() returns > false, that is, if EState.es_cachedplan is found to have been > invalidated after a lock being taken by ExecGetRangeTableRelation(), > whatever funcion called it must return immediately and so must its > caller and so on. ExecEndPlan() seems to be able to clean up after a > partially finished attempt of initializing a PlanState tree in this > way. Maybe my preliminary testing didn't catch cases where pointers > to resources that are normally put into the nodes of a PlanState tree > are now left dangling, because a partially built PlanState tree is not > accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in > such cases. Maybe there's only es_tupleTable and es_relations that > needs to be explicitly released and the rest is taken care of by > resetting the ExecutorState context. In the attached updated patch, I've made the functions that check ExecPlanStillValid() to return NULL (if returning something) instead of returning partially initialized structs. Those partially initialized structs were not being subsequently looked at anyway. > On testing, I'm afraid we're going to need something like > src/test/modules/delay_execution to test that concurrent changes to > relation(s) in PlannedStmt.relationOids that occur somewhere between > RevalidateCachedQuery() and InitPlan() result in the latter to be > aborted and that it is handled correctly. It seems like it is only > the locking of partitions (that are not present in an unplanned Query > and thus not protected by AcquirePlannerLocks()) that can trigger > replanning of a CachedPlan, so any tests we write should involve > partitions. Should this try to test as many plan shapes as possible > though given the uncertainty around ExecEndPlan() robustness or should > manual auditing suffice to be sure that nothing's broken? I've added a test case under src/modules/delay_execution by adding a new ExecutorStart_hook that works similarly as delay_execution_planner(). The test works by allowing a concurrent session to drop an object being referenced in a cached plan being initialized while the ExecutorStart_hook waits to get an advisory lock. The concurrent drop of the referenced object is detected during ExecInitNode() and thus triggers replanning of the cached plan. I also fixed a bug in the ExplainExecuteQuery() while testing and some comments. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
Hi, On 2023-02-03 22:01:09 +0900, Amit Langote wrote: > I've added a test case under src/modules/delay_execution by adding a > new ExecutorStart_hook that works similarly as > delay_execution_planner(). The test works by allowing a concurrent > session to drop an object being referenced in a cached plan being > initialized while the ExecutorStart_hook waits to get an advisory > lock. The concurrent drop of the referenced object is detected during > ExecInitNode() and thus triggers replanning of the cached plan. > > I also fixed a bug in the ExplainExecuteQuery() while testing and some comments. The tests seem to frequently hang on freebsd: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478 Greetings, Andres Freund
On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-02-03 22:01:09 +0900, Amit Langote wrote:
> I've added a test case under src/modules/delay_execution by adding a
> new ExecutorStart_hook that works similarly as
> delay_execution_planner(). The test works by allowing a concurrent
> session to drop an object being referenced in a cached plan being
> initialized while the ExecutorStart_hook waits to get an advisory
> lock. The concurrent drop of the referenced object is detected during
> ExecInitNode() and thus triggers replanning of the cached plan.
>
> I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.
The tests seem to frequently hang on freebsd:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478
Thanks for the heads up. I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for some other failures (on other cirrus machines). Has anyone else been a similar situation?
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Wed, Feb 8, 2023 at 7:31 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote: >> The tests seem to frequently hang on freebsd: >> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478 > > Thanks for the heads up. I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for someother failures (on other cirrus machines). Has anyone else been a similar situation? I think I have figured out what might be going wrong on that cfbot animal after building with the same CPPFLAGS as that animal locally. I had forgotten to update _out/_readRangeTblEntry() to account for the patch's change that a view's RTE_SUBQUERY now also preserves relkind in addition to relid and rellockmode for the locking consideration. Also, I noticed that a multi-query Portal execution with rules was failing (thanks to a regression test added in a7d71c41db) because of the snapshot used for the 2nd query onward not being updated for command ID change under patched model of multi-query Portal execution. To wit, under the patched model, all queries in the multi-query Portal case undergo ExecutorStart() before any of it is run with ExecutorRun(). The patch hadn't changed things however to update the snapshot's command ID for the 2nd query onwards, which caused the aforementioned test case to fail. This new model does however mean that the 2nd query onwards must use PushCopiedSnapshot() given the current requirement of UpdateActiveSnapshotCommandId() that the snapshot passed to it must not be referenced anywhere else. The new model basically requires that each query's QueryDesc points to its own copy of the ActiveSnapshot. That may not be a thing in favor of the patched model though. For now, I haven't been able to come up with a better alternative. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote: > I think I have figured out what might be going wrong on that cfbot > animal after building with the same CPPFLAGS as that animal locally. > I had forgotten to update _out/_readRangeTblEntry() to account for the > patch's change that a view's RTE_SUBQUERY now also preserves relkind > in addition to relid and rellockmode for the locking consideration. > > Also, I noticed that a multi-query Portal execution with rules was > failing (thanks to a regression test added in a7d71c41db) because of > the snapshot used for the 2nd query onward not being updated for > command ID change under patched model of multi-query Portal execution. > To wit, under the patched model, all queries in the multi-query Portal > case undergo ExecutorStart() before any of it is run with > ExecutorRun(). The patch hadn't changed things however to update the > snapshot's command ID for the 2nd query onwards, which caused the > aforementioned test case to fail. > > This new model does however mean that the 2nd query onwards must use > PushCopiedSnapshot() given the current requirement of > UpdateActiveSnapshotCommandId() that the snapshot passed to it must > not be referenced anywhere else. The new model basically requires > that each query's QueryDesc points to its own copy of the > ActiveSnapshot. That may not be a thing in favor of the patched model > though. For now, I haven't been able to come up with a better > alternative. Here's a new version addressing the following 2 points. * Like views, I realized that non-leaf relations of partition trees scanned by an Append/MergeAppend would need to be locked separately, because ExecInitNode() traversal of the plan tree would not account for them. That is, they are not opened using ExecGetRangeTableRelation() or ExecOpenScanRelation(). One exception is that some (if not all) of those non-leaf relations may be referenced in PartitionPruneInfo and so locked as part of initializing the corresponding PartitionPruneState, but I decided not to complicate the code to filter out such relations from the set locked separately. To carry the set of relations to lock, the refactoring patch 0001 re-introduces the List of Bitmapset field named allpartrelids into Append/MergeAppend nodes, which we had previously removed on the grounds that those relations need not be locked separately (commits f2343653f5b, f003a7522bf). * I decided to initialize QueryDesc.planstate even in the cases where ExecInitNode() traversal is aborted in the middle on detecting CachedPlan invalidation such that it points to a partially initialized PlanState tree. My earlier thinking that each PlanState node need not be visited for resource cleanup in such cases was naive after all. To that end, I've fixed the ExecEndNode() subroutines of all Plan node types to account for potentially uninitialized fields. There are a couple of cases where I'm a bit doubtful though. In ExecEndCustomScan(), there's no indication in CustomScanState whether it's OK to call EndCustomScan() when BeginCustomScan() may not have been called. For ForeignScanState, I've assumed that ForeignScanState.fdw_state being set can be used as a marker that BeginForeignScan would have been called, though maybe that's not a solid assumption. I'm also attaching a new (small) patch 0003 that eliminates the loop-over-rangetable in ExecCloseRangeTableRelations() in favor of iterating over a new List field of EState named es_opened_relations, which is populated by ExecGetRangeTableRelation() with only the relations that were opened. This speeds up ExecCloseRangeTableRelations() significantly for the cases with many runtime-prunable partitions. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote: > > I think I have figured out what might be going wrong on that cfbot > > animal after building with the same CPPFLAGS as that animal locally. > > I had forgotten to update _out/_readRangeTblEntry() to account for the > > patch's change that a view's RTE_SUBQUERY now also preserves relkind > > in addition to relid and rellockmode for the locking consideration. > > > > Also, I noticed that a multi-query Portal execution with rules was > > failing (thanks to a regression test added in a7d71c41db) because of > > the snapshot used for the 2nd query onward not being updated for > > command ID change under patched model of multi-query Portal execution. > > To wit, under the patched model, all queries in the multi-query Portal > > case undergo ExecutorStart() before any of it is run with > > ExecutorRun(). The patch hadn't changed things however to update the > > snapshot's command ID for the 2nd query onwards, which caused the > > aforementioned test case to fail. > > > > This new model does however mean that the 2nd query onwards must use > > PushCopiedSnapshot() given the current requirement of > > UpdateActiveSnapshotCommandId() that the snapshot passed to it must > > not be referenced anywhere else. The new model basically requires > > that each query's QueryDesc points to its own copy of the > > ActiveSnapshot. That may not be a thing in favor of the patched model > > though. For now, I haven't been able to come up with a better > > alternative. > > Here's a new version addressing the following 2 points. > > * Like views, I realized that non-leaf relations of partition trees > scanned by an Append/MergeAppend would need to be locked separately, > because ExecInitNode() traversal of the plan tree would not account > for them. That is, they are not opened using > ExecGetRangeTableRelation() or ExecOpenScanRelation(). One exception > is that some (if not all) of those non-leaf relations may be > referenced in PartitionPruneInfo and so locked as part of initializing > the corresponding PartitionPruneState, but I decided not to complicate > the code to filter out such relations from the set locked separately. > To carry the set of relations to lock, the refactoring patch 0001 > re-introduces the List of Bitmapset field named allpartrelids into > Append/MergeAppend nodes, which we had previously removed on the > grounds that those relations need not be locked separately (commits > f2343653f5b, f003a7522bf). > > * I decided to initialize QueryDesc.planstate even in the cases where > ExecInitNode() traversal is aborted in the middle on detecting > CachedPlan invalidation such that it points to a partially initialized > PlanState tree. My earlier thinking that each PlanState node need not > be visited for resource cleanup in such cases was naive after all. To > that end, I've fixed the ExecEndNode() subroutines of all Plan node > types to account for potentially uninitialized fields. There are a > couple of cases where I'm a bit doubtful though. In > ExecEndCustomScan(), there's no indication in CustomScanState whether > it's OK to call EndCustomScan() when BeginCustomScan() may not have > been called. For ForeignScanState, I've assumed that > ForeignScanState.fdw_state being set can be used as a marker that > BeginForeignScan would have been called, though maybe that's not a > solid assumption. > > I'm also attaching a new (small) patch 0003 that eliminates the > loop-over-rangetable in ExecCloseRangeTableRelations() in favor of > iterating over a new List field of EState named es_opened_relations, > which is populated by ExecGetRangeTableRelation() with only the > relations that were opened. This speeds up > ExecCloseRangeTableRelations() significantly for the cases with many > runtime-prunable partitions. Here's another version with some cosmetic changes, like fixing some factually incorrect / obsolete comments and typos that I found. I also noticed that I had missed noting near some table_open() that locks taken with those can't possibly invalidate a plan (such as lazily opened partition routing target partitions) and thus need the treatment that locking during execution initialization requires. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 22, 2023 at 9:48 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > I think I have figured out what might be going wrong on that cfbot > > > animal after building with the same CPPFLAGS as that animal locally. > > > I had forgotten to update _out/_readRangeTblEntry() to account for the > > > patch's change that a view's RTE_SUBQUERY now also preserves relkind > > > in addition to relid and rellockmode for the locking consideration. > > > > > > Also, I noticed that a multi-query Portal execution with rules was > > > failing (thanks to a regression test added in a7d71c41db) because of > > > the snapshot used for the 2nd query onward not being updated for > > > command ID change under patched model of multi-query Portal execution. > > > To wit, under the patched model, all queries in the multi-query Portal > > > case undergo ExecutorStart() before any of it is run with > > > ExecutorRun(). The patch hadn't changed things however to update the > > > snapshot's command ID for the 2nd query onwards, which caused the > > > aforementioned test case to fail. > > > > > > This new model does however mean that the 2nd query onwards must use > > > PushCopiedSnapshot() given the current requirement of > > > UpdateActiveSnapshotCommandId() that the snapshot passed to it must > > > not be referenced anywhere else. The new model basically requires > > > that each query's QueryDesc points to its own copy of the > > > ActiveSnapshot. That may not be a thing in favor of the patched model > > > though. For now, I haven't been able to come up with a better > > > alternative. > > > > Here's a new version addressing the following 2 points. > > > > * Like views, I realized that non-leaf relations of partition trees > > scanned by an Append/MergeAppend would need to be locked separately, > > because ExecInitNode() traversal of the plan tree would not account > > for them. That is, they are not opened using > > ExecGetRangeTableRelation() or ExecOpenScanRelation(). One exception > > is that some (if not all) of those non-leaf relations may be > > referenced in PartitionPruneInfo and so locked as part of initializing > > the corresponding PartitionPruneState, but I decided not to complicate > > the code to filter out such relations from the set locked separately. > > To carry the set of relations to lock, the refactoring patch 0001 > > re-introduces the List of Bitmapset field named allpartrelids into > > Append/MergeAppend nodes, which we had previously removed on the > > grounds that those relations need not be locked separately (commits > > f2343653f5b, f003a7522bf). > > > > * I decided to initialize QueryDesc.planstate even in the cases where > > ExecInitNode() traversal is aborted in the middle on detecting > > CachedPlan invalidation such that it points to a partially initialized > > PlanState tree. My earlier thinking that each PlanState node need not > > be visited for resource cleanup in such cases was naive after all. To > > that end, I've fixed the ExecEndNode() subroutines of all Plan node > > types to account for potentially uninitialized fields. There are a > > couple of cases where I'm a bit doubtful though. In > > ExecEndCustomScan(), there's no indication in CustomScanState whether > > it's OK to call EndCustomScan() when BeginCustomScan() may not have > > been called. For ForeignScanState, I've assumed that > > ForeignScanState.fdw_state being set can be used as a marker that > > BeginForeignScan would have been called, though maybe that's not a > > solid assumption. > > > > I'm also attaching a new (small) patch 0003 that eliminates the > > loop-over-rangetable in ExecCloseRangeTableRelations() in favor of > > iterating over a new List field of EState named es_opened_relations, > > which is populated by ExecGetRangeTableRelation() with only the > > relations that were opened. This speeds up > > ExecCloseRangeTableRelations() significantly for the cases with many > > runtime-prunable partitions. > > Here's another version with some cosmetic changes, like fixing some > factually incorrect / obsolete comments and typos that I found. I > also noticed that I had missed noting near some table_open() that > locks taken with those can't possibly invalidate a plan (such as > lazily opened partition routing target partitions) and thus need the > treatment that locking during execution initialization requires. Rebased over 3c05284d83b2 ("Invent GENERIC_PLAN option for EXPLAIN."). -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
> > On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > * I decided to initialize QueryDesc.planstate even in the cases where > > > ExecInitNode() traversal is aborted in the middle on detecting > > > CachedPlan invalidation such that it points to a partially initialized > > > PlanState tree. My earlier thinking that each PlanState node need not > > > be visited for resource cleanup in such cases was naive after all. To > > > that end, I've fixed the ExecEndNode() subroutines of all Plan node > > > types to account for potentially uninitialized fields. There are a > > > couple of cases where I'm a bit doubtful though. In > > > ExecEndCustomScan(), there's no indication in CustomScanState whether > > > it's OK to call EndCustomScan() when BeginCustomScan() may not have > > > been called. For ForeignScanState, I've assumed that > > > ForeignScanState.fdw_state being set can be used as a marker that > > > BeginForeignScan would have been called, though maybe that's not a > > > solid assumption. It seems I hadn't noted in the ExecEndNode()'s comment that all node types' recursive subroutines need to handle the change made by this patch that the corresponding ExecInitNode() subroutine may now return early without having initialized all state struct fields. Also noted in the documentation for CustomScan and ForeignScan that the Begin*Scan callback may not have been called at all, so the End*Scan should handle that gracefully. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
Amit Langote <amitlangote09@gmail.com> writes: > [ v38 patchset ] I spent a little bit of time looking through this, and concluded that it's not something I will be wanting to push into v16 at this stage. The patch doesn't seem very close to being committable on its own terms, and even if it was now is not a great time in the dev cycle to be making significant executor API changes. Too much risk of having to thrash the API during beta, or even change it some more in v17. I suggest that we push this forward to the next CF with the hope of landing it early in v17. A few concrete thoughts: * I understand that your plan now is to acquire locks on all the originally-named tables, then do permissions checks (which will involve only those tables), then dynamically lock just inheritance and partitioning child tables as we descend the plan tree. That seems more or less okay to me, but it could be reflected better in the structure of the patch perhaps. * In particular I don't much like the "viewRelations" list, which seems like a wart; those ought to be handled more nearly the same way as other RTEs. (One concrete reason why is that this scheme is going to result in locking views in a different order than they were locked during original parsing, which perhaps could contribute to deadlocks.) Maybe we should store an integer list of which RTIs need to be locked in the early phase? Building that in the parser/rewriter would provide a solid guide to the original locking order, so we'd be trivially sure of duplicating that. (It might be close enough to follow the RT list order, which is basically what AcquireExecutorLocks does today, but this'd be more certain to do the right thing.) I'm less concerned about lock order for child tables because those are just going to follow the inheritance or partitioning structure. * I don't understand the need for changes like this: /* clean up tuple table */ - ExecClearTuple(node->ps.ps_ResultTupleSlot); + if (node->ps.ps_ResultTupleSlot) + ExecClearTuple(node->ps.ps_ResultTupleSlot); ISTM that the process ought to involve taking a lock (if needed) before we have built any execution state for a given plan node, and if we find we have to fail, returning NULL instead of a partially-valid planstate node. Otherwise, considerations of how to handle partially-valid nodes are going to metastasize into all sorts of places, almost certainly including EXPLAIN for instance. I think we ought to be able to limit the damage to "parent nodes might have NULL child links that you wouldn't have expected". That wouldn't faze ExecEndNode at all, nor most other code. * More attention is needed to comments. For example, in a couple of places in plancache.c you have removed function header comments defining API details and not replaced them with any info about the new details, despite the fact that those details are more complex than the old. > It seems I hadn't noted in the ExecEndNode()'s comment that all node > types' recursive subroutines need to handle the change made by this > patch that the corresponding ExecInitNode() subroutine may now return > early without having initialized all state struct fields. > Also noted in the documentation for CustomScan and ForeignScan that > the Begin*Scan callback may not have been called at all, so the > End*Scan should handle that gracefully. Yeah, I think we need to avoid adding such requirements. It's the sort of thing that would far too easily get past developer testing and only fail once in a blue moon in the field. regards, tom lane
On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > [ v38 patchset ] > > I spent a little bit of time looking through this, and concluded that > it's not something I will be wanting to push into v16 at this stage. > The patch doesn't seem very close to being committable on its own > terms, and even if it was now is not a great time in the dev cycle > to be making significant executor API changes. Too much risk of > having to thrash the API during beta, or even change it some more > in v17. I suggest that we push this forward to the next CF with the > hope of landing it early in v17. OK, thanks a lot for your feedback. > A few concrete thoughts: > > * I understand that your plan now is to acquire locks on all the > originally-named tables, then do permissions checks (which will > involve only those tables), then dynamically lock just inheritance and > partitioning child tables as we descend the plan tree. Actually, with the current implementation of the patch, *all* of the relations mentioned in the plan tree would get locked during the ExecInitNode() traversal of the plan tree (and of those in plannedstmt->subplans), not just the inheritance child tables. Locking of non-child tables done by the executor after this patch is duplicative with AcquirePlannerLocks(), so that's something to be improved. > That seems > more or less okay to me, but it could be reflected better in the > structure of the patch perhaps. > > * In particular I don't much like the "viewRelations" list, which > seems like a wart; those ought to be handled more nearly the same way > as other RTEs. (One concrete reason why is that this scheme is going > to result in locking views in a different order than they were locked > during original parsing, which perhaps could contribute to deadlocks.) > Maybe we should store an integer list of which RTIs need to be locked > in the early phase? Building that in the parser/rewriter would provide > a solid guide to the original locking order, so we'd be trivially sure > of duplicating that. (It might be close enough to follow the RT list > order, which is basically what AcquireExecutorLocks does today, but > this'd be more certain to do the right thing.) I'm less concerned > about lock order for child tables because those are just going to > follow the inheritance or partitioning structure. What you've described here sounds somewhat like what I had implemented in the patch versions till v31, though it used a bitmapset named minLockRelids that is initialized by setrefs.c. Your idea of initializing a list before planning seems more appealing offhand than the code I had added in setrefs.c to populate that minLockRelids bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)), followed by bms_del_members(set-of-child-rel-rtis). I'll give your idea a try. > * I don't understand the need for changes like this: > > /* clean up tuple table */ > - ExecClearTuple(node->ps.ps_ResultTupleSlot); > + if (node->ps.ps_ResultTupleSlot) > + ExecClearTuple(node->ps.ps_ResultTupleSlot); > > ISTM that the process ought to involve taking a lock (if needed) > before we have built any execution state for a given plan node, > and if we find we have to fail, returning NULL instead of a > partially-valid planstate node. Otherwise, considerations of how > to handle partially-valid nodes are going to metastasize into all > sorts of places, almost certainly including EXPLAIN for instance. > I think we ought to be able to limit the damage to "parent nodes > might have NULL child links that you wouldn't have expected". > That wouldn't faze ExecEndNode at all, nor most other code. Hmm, yes, taking a lock before allocating any of the stuff to add into the planstate seems like it's much easier to reason about than the alternative I've implemented. > * More attention is needed to comments. For example, in a couple of > places in plancache.c you have removed function header comments > defining API details and not replaced them with any info about the new > details, despite the fact that those details are more complex than the > old. OK, yeah, maybe I've added a bunch of explanations in execMain.c that should perhaps have been in plancache.c. > > It seems I hadn't noted in the ExecEndNode()'s comment that all node > > types' recursive subroutines need to handle the change made by this > > patch that the corresponding ExecInitNode() subroutine may now return > > early without having initialized all state struct fields. > > Also noted in the documentation for CustomScan and ForeignScan that > > the Begin*Scan callback may not have been called at all, so the > > End*Scan should handle that gracefully. > > Yeah, I think we need to avoid adding such requirements. It's the > sort of thing that would far too easily get past developer testing > and only fail once in a blue moon in the field. OK, got it. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On Tue, Apr 4, 2023 at 10:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A few concrete thoughts:
> >
> > * I understand that your plan now is to acquire locks on all the
> > originally-named tables, then do permissions checks (which will
> > involve only those tables), then dynamically lock just inheritance and
> > partitioning child tables as we descend the plan tree.
>
> Actually, with the current implementation of the patch, *all* of the
> relations mentioned in the plan tree would get locked during the
> ExecInitNode() traversal of the plan tree (and of those in
> plannedstmt->subplans), not just the inheritance child tables.
> Locking of non-child tables done by the executor after this patch is
> duplicative with AcquirePlannerLocks(), so that's something to be
> improved.
>
> > That seems
> > more or less okay to me, but it could be reflected better in the
> > structure of the patch perhaps.
> >
> > * In particular I don't much like the "viewRelations" list, which
> > seems like a wart; those ought to be handled more nearly the same way
> > as other RTEs. (One concrete reason why is that this scheme is going
> > to result in locking views in a different order than they were locked
> > during original parsing, which perhaps could contribute to deadlocks.)
> > Maybe we should store an integer list of which RTIs need to be locked
> > in the early phase? Building that in the parser/rewriter would provide
> > a solid guide to the original locking order, so we'd be trivially sure
> > of duplicating that. (It might be close enough to follow the RT list
> > order, which is basically what AcquireExecutorLocks does today, but
> > this'd be more certain to do the right thing.) I'm less concerned
> > about lock order for child tables because those are just going to
> > follow the inheritance or partitioning structure.
>
> What you've described here sounds somewhat like what I had implemented
> in the patch versions till v31, though it used a bitmapset named
> minLockRelids that is initialized by setrefs.c. Your idea of
> initializing a list before planning seems more appealing offhand than
> the code I had added in setrefs.c to populate that minLockRelids
> bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)),
> followed by bms_del_members(set-of-child-rel-rtis).
>
> I'll give your idea a try.
> On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A few concrete thoughts:
> >
> > * I understand that your plan now is to acquire locks on all the
> > originally-named tables, then do permissions checks (which will
> > involve only those tables), then dynamically lock just inheritance and
> > partitioning child tables as we descend the plan tree.
>
> Actually, with the current implementation of the patch, *all* of the
> relations mentioned in the plan tree would get locked during the
> ExecInitNode() traversal of the plan tree (and of those in
> plannedstmt->subplans), not just the inheritance child tables.
> Locking of non-child tables done by the executor after this patch is
> duplicative with AcquirePlannerLocks(), so that's something to be
> improved.
>
> > That seems
> > more or less okay to me, but it could be reflected better in the
> > structure of the patch perhaps.
> >
> > * In particular I don't much like the "viewRelations" list, which
> > seems like a wart; those ought to be handled more nearly the same way
> > as other RTEs. (One concrete reason why is that this scheme is going
> > to result in locking views in a different order than they were locked
> > during original parsing, which perhaps could contribute to deadlocks.)
> > Maybe we should store an integer list of which RTIs need to be locked
> > in the early phase? Building that in the parser/rewriter would provide
> > a solid guide to the original locking order, so we'd be trivially sure
> > of duplicating that. (It might be close enough to follow the RT list
> > order, which is basically what AcquireExecutorLocks does today, but
> > this'd be more certain to do the right thing.) I'm less concerned
> > about lock order for child tables because those are just going to
> > follow the inheritance or partitioning structure.
>
> What you've described here sounds somewhat like what I had implemented
> in the patch versions till v31, though it used a bitmapset named
> minLockRelids that is initialized by setrefs.c. Your idea of
> initializing a list before planning seems more appealing offhand than
> the code I had added in setrefs.c to populate that minLockRelids
> bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)),
> followed by bms_del_members(set-of-child-rel-rtis).
>
> I'll give your idea a try.
After sleeping on this, I think we perhaps don't need to remember originally-named relations if only for the purpose of locking them for execution. That's because, for a reused (cached) plan, AcquirePlannerLocks() would have taken those locks anyway.
AcquirePlannerLocks() doesn't lock inheritance children because they would be added to the range table by the planner, so they should be locked separately for execution, if needed. I thought taking the execution-time locks only when inside ExecInit[Merge]Append would work, but then we have cases where single-child Append/MergeAppend are stripped of the Append/MergeAppend nodes by setrefs.c. Maybe we need a place to remember such child relations, that is, only in the cases where Append/MergeAppend elision occurs, in something maybe esoteric-sounding like PlannedStmt.elidedAppendChildRels or something?
Another set of child relations that are not covered by Append/MergeAppend child nodes is non-leaf partitions. I've proposed adding a List of Bitmapset field to Append/MergeAppend named 'allpartrelids' as part of this patchset (patch 0001) to track those for execution-time locking.
-- AcquirePlannerLocks() doesn't lock inheritance children because they would be added to the range table by the planner, so they should be locked separately for execution, if needed. I thought taking the execution-time locks only when inside ExecInit[Merge]Append would work, but then we have cases where single-child Append/MergeAppend are stripped of the Append/MergeAppend nodes by setrefs.c. Maybe we need a place to remember such child relations, that is, only in the cases where Append/MergeAppend elision occurs, in something maybe esoteric-sounding like PlannedStmt.elidedAppendChildRels or something?
Another set of child relations that are not covered by Append/MergeAppend child nodes is non-leaf partitions. I've proposed adding a List of Bitmapset field to Append/MergeAppend named 'allpartrelids' as part of this patchset (patch 0001) to track those for execution-time locking.
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
Here is a new version. Summary of main changes since the last version that Tom reviewed back in April: * ExecInitNode() subroutines now return NULL (as opposed to a partially initialized PlanState node as in the last version) upon detecting that the CachedPlan that the plan tree is from is no longer valid due to invalidation messages processed upon taking locks. Plan tree subnodes that are fully initialized till the point of detection are added by ExecInitNode() into a List in EState called es_inited_plannodes. ExecEndPlan() now iterates over that list to close each one individually using ExecEndNode(). ExecEndNode() or its subroutines thus no longer need to be recursive to close the child nodes. Also, with this design, there is no longer the possibility of partially initialized PlanState trees with partially initialized individual PlanState nodes, so the ExecEndNode() subroutine changes that were in the last version to account for partial initialization are not necessary. * Instead of setting EXEC_FLAG_GET_LOCKS in es_top_eflags for the entire duration of InitPlan(), it is now only set in ExecInitAppend() and ExecInitMergeAppend(), because that's where the subnodes scanning child tables would be and the executor only needs to lock child tables to validate a CachedPlan in a race-free manner. Parent tables that appear in the query would have been locked by AcquirePlannerLocks(). Child tables whose scan subnodes don't appear under Append/MergeAppend (due to the latter being removed by setrefs.c for there being only a single child) are identified in PlannedStmt.elidedAppendChildRelations and InitPlan() locks each one found there if the plan tree is from a CachedPlan. * There's no longer PlannedStmt.viewRelations, because view relations need not be tracked separately for locking as AcquirePlannerLocks() covers them.
Attachment
> On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote: > > Here is a new version. The local planstate variable in the hunk below is shadowing the function parameter planstate which cause a compiler warning: @@ -1495,18 +1556,15 @@ ExecEndPlan(PlanState *planstate, EState *estate) ListCell *l; /* - * shut down the node-type-specific query processing - */ - ExecEndNode(planstate); - - /* - * for subplans too + * Shut down the node-type-specific query processing for all nodes that + * were initialized during InitPlan(), both in the main plan tree and those + * in subplans (es_subplanstates), if any. */ - foreach(l, estate->es_subplanstates) + foreach(l, estate->es_inited_plannodes) { - PlanState *subplanstate = (PlanState *) lfirst(l); + PlanState *planstate = (PlanState *) lfirst(l); -- Daniel Gustafsson
On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote: > > > > Here is a new version. > > The local planstate variable in the hunk below is shadowing the function > parameter planstate which cause a compiler warning: Thanks Daniel for the heads up. Attached new version fixes that and contains a few other notable changes. Before going into the details of those changes, let me reiterate in broad strokes what the patch is trying to do. The idea is to move the locking of some tables referenced in a cached (generic) plan from plancache/GetCachedPlan() to the executor/ExecutorStart(). Specifically, the locking of inheritance child tables. Why? Because partition pruning with "initial pruning steps" contained in the Append/MergeAppend nodes may eliminate some child tables that need not have been locked to begin with, though the pruning can only occur during ExecutorStart(). After applying this patch, GetCachedPlan() only locks the tables that are directly mentioned in the query to ensure that the analyzed-rewritten-but-unplanned query tree backing a given CachedPlan is still valid (cf RevalidateCachedQuery()), but not the tables in the CachedPlan that would have been added by the planner. Tables in a CachePlan that would not be locked currently only include the inheritance child tables / partitions of the tables mentioned in the query. This means that the plan trees in a given CachedPlan returned by GetCachedPlan() are only partially valid and are subject to invalidation because concurrent sessions can possibly modify the child tables referenced in them before ExecutorStart() gets around to locking them. If the concurrent modifications do happen, ExecutorStart() is now equipped to detect them by way of noticing that the CachedPlan is invalidated and inform the caller to discard and recreate the CachedPlan. This entails changing all the call sites of ExecutorStart() that pass it a plan tree from a CachedPlan to implement the replan-and-retry-execution loop. Given the above, ExecutorStart(), which has not needed so far to take any locks (except on indexes mentioned in IndexScans), now needs to lock child tables if executing a cached plan which contains them. In the previous versions, the patch used a flag passed in EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the table. The flag would be set in ExecInitAppend() and ExecInitMergeAppend() for the duration of the loop that initializes child subplans with the assumption that that's where the child tables would be opened. But not all child subplans of Append/MergeAppend scan child tables (think UNION ALL queries), so this approach can result in redundant locking. Worse, I needed to invent PlannedStmt.elidedAppendChildRelations to separately track child tables whose Scan nodes' parent Append/MergeAppend would be removed by setrefs.c in some cases. So, this new patch uses a flag in the RangeTblEntry itself to denote if the table is a child table instead of the above roundabout way. ExecGetRangeTableRelation() can simply look at the RTE to decide whether to take a lock or not. I considered adding a new bool field, but noticed we already have inFromCl to track if a given RTE is for table/entity directly mentioned in the query or for something added behind-the-scenes into the range table as the field's description in parsenodes.h says. RTEs for child tables are added behind-the-scenes by the planner and it makes perfect sense to me to mark their inFromCl as false. I can't find anything that relies on the current behavior of inFromCl being set to the same value as the root inheritance parent (true). Patch 0002 makes this change for child RTEs. A few other notes: * A parallel worker does ExecutorStart() without access to the CachedPlan that the leader may have gotten its plan tree from. This means that parallel workers do not have the ability to detect plan tree invalidations. I think that's fine, because if the leader would have been able to launch workers at all, it would also have gotten all the locks to protect the (portion of) the plan tree that the workers would be executing. I had an off-list discussion about this with Robert and he mentioned his concern that each parallel worker would have its own view of which child subplans of a parallel Append are "valid" that depends on the result of its own evaluation of initial pruning. So, there may be race conditions whereby a worker may try to execute plan nodes that are no longer valid, for example, if the partition a worker considers valid is not viewed as such by the leader and thus not locked. I shared my thoughts as to why that sounds unlikely at [1], though maybe I'm a bit too optimistic? * For multi-query portals, you can't now do ExecutorStart() immediately followed by ExecutorRun() for each query in the portal, because ExecutorStart() may now fail to start a plan if it gets invalidated. So PortalStart() now does ExecutorStart()s for all queries and remembers the QueryDescs for PortalRun() then to do ExecutorRun()s using. A consequence of this is that CommandCounterIncrement() now must be done between the ExecutorStart()s of the individual plans in PortalStart() and not between the ExecutorRun()s in PortalRunMulti(). make check-world passes with this new arrangement, though I'm not entirely confident that there are no problems lurking. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com [1] https://postgr/es/m/CA+HiwqFA=swkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw@mail.gmail.com
Attachment
On Thu, Jul 6, 2023 at 11:29 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote: > > > Here is a new version. > > > > The local planstate variable in the hunk below is shadowing the function > > parameter planstate which cause a compiler warning: > > Thanks Daniel for the heads up. > > Attached new version fixes that and contains a few other notable > changes. Before going into the details of those changes, let me > reiterate in broad strokes what the patch is trying to do. > > The idea is to move the locking of some tables referenced in a cached > (generic) plan from plancache/GetCachedPlan() to the > executor/ExecutorStart(). Specifically, the locking of inheritance > child tables. Why? Because partition pruning with "initial pruning > steps" contained in the Append/MergeAppend nodes may eliminate some > child tables that need not have been locked to begin with, though the > pruning can only occur during ExecutorStart(). > > After applying this patch, GetCachedPlan() only locks the tables that > are directly mentioned in the query to ensure that the > analyzed-rewritten-but-unplanned query tree backing a given CachedPlan > is still valid (cf RevalidateCachedQuery()), but not the tables in the > CachedPlan that would have been added by the planner. Tables in a > CachePlan that would not be locked currently only include the > inheritance child tables / partitions of the tables mentioned in the > query. This means that the plan trees in a given CachedPlan returned > by GetCachedPlan() are only partially valid and are subject to > invalidation because concurrent sessions can possibly modify the child > tables referenced in them before ExecutorStart() gets around to > locking them. If the concurrent modifications do happen, > ExecutorStart() is now equipped to detect them by way of noticing that > the CachedPlan is invalidated and inform the caller to discard and > recreate the CachedPlan. This entails changing all the call sites of > ExecutorStart() that pass it a plan tree from a CachedPlan to > implement the replan-and-retry-execution loop. > > Given the above, ExecutorStart(), which has not needed so far to take > any locks (except on indexes mentioned in IndexScans), now needs to > lock child tables if executing a cached plan which contains them. In > the previous versions, the patch used a flag passed in > EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the > table. The flag would be set in ExecInitAppend() and > ExecInitMergeAppend() for the duration of the loop that initializes > child subplans with the assumption that that's where the child tables > would be opened. But not all child subplans of Append/MergeAppend > scan child tables (think UNION ALL queries), so this approach can > result in redundant locking. Worse, I needed to invent > PlannedStmt.elidedAppendChildRelations to separately track child > tables whose Scan nodes' parent Append/MergeAppend would be removed by > setrefs.c in some cases. > > So, this new patch uses a flag in the RangeTblEntry itself to denote > if the table is a child table instead of the above roundabout way. > ExecGetRangeTableRelation() can simply look at the RTE to decide > whether to take a lock or not. I considered adding a new bool field, > but noticed we already have inFromCl to track if a given RTE is for > table/entity directly mentioned in the query or for something added > behind-the-scenes into the range table as the field's description in > parsenodes.h says. RTEs for child tables are added behind-the-scenes > by the planner and it makes perfect sense to me to mark their inFromCl > as false. I can't find anything that relies on the current behavior > of inFromCl being set to the same value as the root inheritance parent > (true). Patch 0002 makes this change for child RTEs. > > A few other notes: > > * A parallel worker does ExecutorStart() without access to the > CachedPlan that the leader may have gotten its plan tree from. This > means that parallel workers do not have the ability to detect plan > tree invalidations. I think that's fine, because if the leader would > have been able to launch workers at all, it would also have gotten all > the locks to protect the (portion of) the plan tree that the workers > would be executing. I had an off-list discussion about this with > Robert and he mentioned his concern that each parallel worker would > have its own view of which child subplans of a parallel Append are > "valid" that depends on the result of its own evaluation of initial > pruning. So, there may be race conditions whereby a worker may try > to execute plan nodes that are no longer valid, for example, if the > partition a worker considers valid is not viewed as such by the leader > and thus not locked. I shared my thoughts as to why that sounds > unlikely at [1], though maybe I'm a bit too optimistic? > > * For multi-query portals, you can't now do ExecutorStart() > immediately followed by ExecutorRun() for each query in the portal, > because ExecutorStart() may now fail to start a plan if it gets > invalidated. So PortalStart() now does ExecutorStart()s for all > queries and remembers the QueryDescs for PortalRun() then to do > ExecutorRun()s using. A consequence of this is that > CommandCounterIncrement() now must be done between the > ExecutorStart()s of the individual plans in PortalStart() and not > between the ExecutorRun()s in PortalRunMulti(). make check-world > passes with this new arrangement, though I'm not entirely confident > that there are no problems lurking. In an absolutely brown-paper-bag moment, I realized that I had not updated src/backend/executor/README to reflect the changes to the executor's control flow that this patch makes. That is, after scrapping the old design back in January whose details *were* reflected in the patches before that redesign. Anyway, the attached fixes that. Tom, do you think you have bandwidth in the near future to give this another look? I think I've addressed the comments that you had given back in April, though as mentioned in the previous message, there may still be some funny-looking aspects still remaining. In any case, I have no intention of pressing ahead with the patch without another committer having had a chance to sign off on it. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote: > In an absolutely brown-paper-bag moment, I realized that I had not > updated src/backend/executor/README to reflect the changes to the > executor's control flow that this patch makes. That is, after > scrapping the old design back in January whose details *were* > reflected in the patches before that redesign. > > Anyway, the attached fixes that. > > Tom, do you think you have bandwidth in the near future to give this > another look? I think I've addressed the comments that you had given > back in April, though as mentioned in the previous message, there may > still be some funny-looking aspects still remaining. In any case, I > have no intention of pressing ahead with the patch without another > committer having had a chance to sign off on it. I've only just started taking a look at this, and my first test drive yields very impressive results: 8192 partitions (3 runs, 10000 rows) Head 391.294989 382.622481 379.252236 Patched 13088.145995 13406.135531 13431.828051 Looking at your changes to README, I would like to suggest rewording the following: +table during planning. This means that inheritance child tables, which are +added to the query's range table during planning, if they are present in a +cached plan tree would not have been locked. To: This means that inheritance child tables present in a cached plan tree, which are added to the query's range table during planning, would not have been locked. Also, further down: s/intiatialize/initialize/ I'll carry on taking a closer look and see if I can break it. Thom
Hi Thom, On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote: > On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote: > > In an absolutely brown-paper-bag moment, I realized that I had not > > updated src/backend/executor/README to reflect the changes to the > > executor's control flow that this patch makes. That is, after > > scrapping the old design back in January whose details *were* > > reflected in the patches before that redesign. > > > > Anyway, the attached fixes that. > > > > Tom, do you think you have bandwidth in the near future to give this > > another look? I think I've addressed the comments that you had given > > back in April, though as mentioned in the previous message, there may > > still be some funny-looking aspects still remaining. In any case, I > > have no intention of pressing ahead with the patch without another > > committer having had a chance to sign off on it. > > I've only just started taking a look at this, and my first test drive > yields very impressive results: > > 8192 partitions (3 runs, 10000 rows) > Head 391.294989 382.622481 379.252236 > Patched 13088.145995 13406.135531 13431.828051 Just to be sure, did you use pgbench --Mprepared with plan_cache_mode = force_generic_plan in postgresql.conf? > Looking at your changes to README, I would like to suggest rewording > the following: > > +table during planning. This means that inheritance child tables, which are > +added to the query's range table during planning, if they are present in a > +cached plan tree would not have been locked. > > To: > > This means that inheritance child tables present in a cached plan > tree, which are added to the query's range table during planning, > would not have been locked. > > Also, further down: > > s/intiatialize/initialize/ > > I'll carry on taking a closer look and see if I can break it. Thanks for looking. I've fixed these issues in the attached updated patch. I've also changed the position of a newly added paragraph in src/backend/executor/README so that it doesn't break the flow of the existing text. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Tue, 18 Jul 2023, 08:26 Amit Langote, <amitlangote09@gmail.com> wrote:
Hi Thom,
On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote:
> On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> > In an absolutely brown-paper-bag moment, I realized that I had not
> > updated src/backend/executor/README to reflect the changes to the
> > executor's control flow that this patch makes. That is, after
> > scrapping the old design back in January whose details *were*
> > reflected in the patches before that redesign.
> >
> > Anyway, the attached fixes that.
> >
> > Tom, do you think you have bandwidth in the near future to give this
> > another look? I think I've addressed the comments that you had given
> > back in April, though as mentioned in the previous message, there may
> > still be some funny-looking aspects still remaining. In any case, I
> > have no intention of pressing ahead with the patch without another
> > committer having had a chance to sign off on it.
>
> I've only just started taking a look at this, and my first test drive
> yields very impressive results:
>
> 8192 partitions (3 runs, 10000 rows)
> Head 391.294989 382.622481 379.252236
> Patched 13088.145995 13406.135531 13431.828051
Just to be sure, did you use pgbench --Mprepared with plan_cache_mode
= force_generic_plan in postgresql.conf?
I did.
For full disclosure, I also had max_locks_per_transaction set to 10000.
> Looking at your changes to README, I would like to suggest rewording
> the following:
>
> +table during planning. This means that inheritance child tables, which are
> +added to the query's range table during planning, if they are present in a
> +cached plan tree would not have been locked.
>
> To:
>
> This means that inheritance child tables present in a cached plan
> tree, which are added to the query's range table during planning,
> would not have been locked.
>
> Also, further down:
>
> s/intiatialize/initialize/
>
> I'll carry on taking a closer look and see if I can break it.
Thanks for looking. I've fixed these issues in the attached updated
patch. I've also changed the position of a newly added paragraph in
src/backend/executor/README so that it doesn't break the flow of the
existing text.
Thanks.
Thom
While chatting with Robert about this patch set, he suggested that it would be better to break out some executor refactoring changes from the main patch (0003) into a separate patch. To wit, the changes to make the PlanState tree cleanup in ExecEndPlan() non-recursive by walking a flat list of PlanState nodes instead of the recursive tree walk that ExecEndNode() currently does. That allows us to cleanly handle the cases where the PlanState tree is only partially constructed when ExecInitNode() detects in the middle of its construction that the plan tree is no longer valid after receiving and processing an invalidation message on locking child tables. Or at least more cleanly than the previously proposed approach of adjusting ExecEndNode() subroutines for the individual node types to gracefully handle such partially initialized PlanState trees. With the new approach, node type specific subroutines of ExecEndNode() need not close its child nodes, because ExecEndPlan() would close each node that would have been initialized directly. I couldn't find any instance of breakage by this decoupling of child node cleanup from their parent node's cleanup. Comments in ExecEndGather() and ExecEndGatherMerge() appear to suggest that outerPlan must be closed before the local cleanup: void ExecEndGather(GatherState *node) { - ExecEndNode(outerPlanState(node)); /* let children clean up first */ + /* outerPlan is closed separately. */ ExecShutdownGather(node); ExecFreeExprContext(&node->ps); But I don't think there's a problem, because what ExecShutdownGather() does seems entirely independent of cleanup of outerPlan. As for the performance impact of initializing the list of initialized nodes to use during the cleanup phase, I couldn't find a regression, nor any improvement by replacing the tree walk by linear scan of a list. Actually, ExecEndNode() is pretty far down in the perf profile anyway, so the performance difference caused by the patch hardly matters. See the following contrived example: create table f(); analyze f; explain (costs off) select count(*) from f f1, f f2, f f3, f f4, f f5, f f6, f f7, f f8, f f9, f f10; QUERY PLAN ------------------------------------------------------------------------------ Aggregate -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Nested Loop -> Seq Scan on f f1 -> Seq Scan on f f2 -> Seq Scan on f f3 -> Seq Scan on f f4 -> Seq Scan on f f5 -> Seq Scan on f f6 -> Seq Scan on f f7 -> Seq Scan on f f8 -> Seq Scan on f f9 -> Seq Scan on f f10 (20 rows) do $$ begin for i in 1..100000 loop perform count(*) from f f1, f f2, f f3, f f4, f f5, f f6, f f7, f f8, f f9, f f10; end loop; end; $$; Times for the DO: Unpatched: Time: 756.353 ms Time: 745.752 ms Time: 749.184 ms Patched: Time: 737.717 ms Time: 747.815 ms Time: 753.456 ms I've attached the new refactoring patch as 0001. Another change I've made in the main patch is to change the API of ExecutorStart() (and ExecutorStart_hook) more explicitly to return a boolean indicating whether or not the plan initialization was successful. That way seems better than making the callers figure that out by seeing that QueryDesc.planstate is NULL and/or checking QueryDesc.plan_valid. Correspondingly, PortalStart() now also returns true or false matching what ExecutorStart() returned. I suppose this better alerts any extensions that use the ExecutorStart_hook to fix their code to do the right thing. Having extracted the ExecEndNode() change, I'm also starting to feel inclined to extract a couple of other bits from the main patch as separate patches, such as moving the ExecutorStart() call from PortalRun() to PortalStart() for the multi-query portals. I'll do that in the next version.
Attachment
- v43-0002-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v43-0001-Make-PlanState-tree-cleanup-non-recursive.patch
- v43-0005-Track-opened-range-table-relations-in-a-List-in-.patch
- v43-0003-Set-inFromCl-to-false-in-child-table-RTEs.patch
- v43-0004-Delay-locking-of-child-tables-in-cached-plans-un.patch
On Wed, Aug 2, 2023 at 10:39 PM Amit Langote <amitlangote09@gmail.com> wrote: > Having extracted the ExecEndNode() change, I'm also starting to feel > inclined to extract a couple of other bits from the main patch as > separate patches, such as moving the ExecutorStart() call from > PortalRun() to PortalStart() for the multi-query portals. I'll do > that in the next version. Here's a patch set where the refactoring to move the ExecutorStart() calls to be closer to GetCachedPlan() (for the call sites that use a CachedPlan) is extracted into a separate patch, 0002. Its commit message notes an aspect of this refactoring that I feel a bit nervous about -- needing to also move the CommandCounterIncrement() call from the loop in PortalRunMulti() to PortalStart() which now does ExecutorStart() for the PORTAL_MULTI_QUERY case. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v44-0004-Set-inFromCl-to-false-in-child-table-RTEs.patch
- v44-0006-Track-opened-range-table-relations-in-a-List-in-.patch
- v44-0002-Refactoring-to-move-ExecutorStart-calls-to-be-ne.patch
- v44-0003-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v44-0005-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v44-0001-Make-PlanState-tree-cleanup-non-recursive.patch
On Thu, Aug 3, 2023 at 4:37 AM Amit Langote <amitlangote09@gmail.com> wrote: > Here's a patch set where the refactoring to move the ExecutorStart() > calls to be closer to GetCachedPlan() (for the call sites that use a > CachedPlan) is extracted into a separate patch, 0002. Its commit > message notes an aspect of this refactoring that I feel a bit nervous > about -- needing to also move the CommandCounterIncrement() call from > the loop in PortalRunMulti() to PortalStart() which now does > ExecutorStart() for the PORTAL_MULTI_QUERY case. I spent some time today reviewing 0001. Here are a few thoughts and notes about things that I looked at. First, I wondered whether it was really adequate for ExecEndPlan() to just loop over estate->es_plan_nodes and call it good. Put differently, is it possible that we could ever have more than one relevant EState, say for a subplan or an EPQ execution or something, so that this loop wouldn't cover everything? I found nothing to make me think that this is a real danger. Second, I wondered whether the ordering of cleanup operations could be an issue. Right now, a node can position cleanup code before, after, or both before and after recursing to child nodes, whereas with this design change, the cleanup code will always be run before recursing to child nodes. Here, I think we have problems. Both ExecGather and ExecEndGatherMerge intentionally clean up the children before the parent, so that the child shutdown happens before ExecParallelCleanup(). Based on the comment and commit acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be intentional, and you can sort of see why from looking at the stuff that happens in ExecParallelCleanup(). If the instrumentation data vanishes before the child nodes have a chance to clean things up, maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If the DSA vanishes, maybe we'll crash if we try to access it. If we actually reach DestroyParallelContext(), we're just going to start killing the workers. None of that sounds like what we want. The good news, of a sort, is that I think this might be the only case of this sort of problem. Most nodes recurse at the end, after doing all the cleanup, so the behavior won't change. Moreover, even if it did, most cleanup operations look pretty localized -- they affect only the node itself, and not its children. A somewhat interesting case is nodes associated with subplans. Right now, because of the coding of ExecEndPlan, nodes associated with subplans are all cleaned up at the very end, after everything that's not inside of a subplan. But with this change, they'd get cleaned up in the order of initialization, which actually seems more natural, as long as it doesn't break anything, which I think it probably won't, since as I mention in most cases node cleanup looks quite localized, i.e. it doesn't care whether it happens before or after the cleanup of other nodes. I think something will have to be done about the parallel query stuff, though. I'm not sure exactly what. It is a little weird that Gather and Gather Merge treat starting and killing workers as a purely "private matter" that they can decide to handle without the executor overall being very much aware of it. So maybe there's a way that some of the cleanup logic here could be hoisted up into the general executor machinery, that is, first end all the nodes, and then go back, and end all the parallelism using, maybe, another list inside of the estate. However, I think that the existence of ExecShutdownNode() is a complication here -- we need to make sure that we don't break either the case where that happen before overall plan shutdown, or the case where it doesn't. Third, a couple of minor comments on details of how you actually made these changes in the patch set. Personally, I would remove all of the "is closed separately" comments that you added. I think it's a violation of the general coding principle that you should make the code look like it's always been that way. Sure, in the immediate future, people might wonder why you don't need to recurse, but 5 or 10 years from now that's just going to be clutter. Second, in the cases where the ExecEndNode functions end up completely empty, I would suggest just removing the functions entirely and making the switch that dispatches on the node type have a switch case that lists all the nodes that don't need a callback here and say /* Nothing do for these node types */ break;. This will save a few CPU cycles and I think it will be easier to read as well. Fourth, I wonder whether we really need this patch at all. I initially thought we did, because if we abandon the initialization of a plan partway through, then we end up with a plan that is in a state that previously would never have occurred, and we still have to be able to clean it up. However, perhaps it's a difference without a distinction. Say we have a partial plan tree, where not all of the PlanState nodes ever got created. We then just call the existing version of ExecEndPlan() on it, with no changes. What goes wrong? Sure, we might call ExecEndNode() on some null pointers where in the current world there would always be valid pointers, but ExecEndNode() will handle that just fine, by doing nothing for those nodes, because it starts with a NULL-check. Another alternative design might be to switch ExecEndNode to use planstate_tree_walker to walk the node tree, removing the walk from the node-type-specific functions as in this patch, and deleting the end-node functions that are no longer required altogether, as proposed above. I somehow feel that this would be cleaner than the status quo, but here again, I'm not sure we really need it. planstate_tree_walker would just pass over any NULL pointers that it found without doing anything, but the current code does that too, so while this might be more beautiful than what we have now, I'm not sure that there's any real reason to do it. The fact that, like the current patch, it would change the order in which nodes are cleaned up is also an issue -- the Gather/Gather Merge ordering issues might be easier to handle this way with some hack in ExecEndNode() than they are with the design you have now, but we'd still have to do something about them, I believe. Sorry if this is a bit of a meandering review, but those are my thoughts. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Second, I wondered whether the ordering of cleanup operations could be > an issue. Right now, a node can position cleanup code before, after, > or both before and after recursing to child nodes, whereas with this > design change, the cleanup code will always be run before recursing to > child nodes. Here, I think we have problems. Both ExecGather and > ExecEndGatherMerge intentionally clean up the children before the > parent, so that the child shutdown happens before > ExecParallelCleanup(). Based on the comment and commit > acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be > intentional, and you can sort of see why from looking at the stuff > that happens in ExecParallelCleanup(). Right, I doubt that changing that is going to work out well. Hash joins might have issues with it too. Could it work to make the patch force child cleanup before parent, instead of after? Or would that break other places? On the whole though I think it's probably a good idea to leave parent nodes in control of the timing, so I kind of side with your later comment about whether we want to change this at all. regards, tom lane
On Mon, Aug 7, 2023 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right, I doubt that changing that is going to work out well. > Hash joins might have issues with it too. I thought about the case, because Hash and Hash Join are such closely intertwined nodes, but I don't see any problem there. It doesn't really look like it would matter in what order things got cleaned up. Unless I'm missing something, all of the data structures are just independent things that we have to get rid of sometime. > Could it work to make the patch force child cleanup before parent, > instead of after? Or would that break other places? To me, it seems like the overwhelming majority of the code simply doesn't care. You could pick an order out of a hat and it would be 100% OK. But I haven't gone and looked through it with this specific idea in mind. > On the whole though I think it's probably a good idea to leave > parent nodes in control of the timing, so I kind of side with > your later comment about whether we want to change this at all. My overall feeling here is that what Gather and Gather Merge is doing is pretty weird. I think I kind of knew that at the time this was all getting implemented and reviewed, but I wasn't keen to introduce more infrastructure changes than necessary given that parallel query, as a project, was still pretty new and I didn't want to give other hackers more reasons to be unhappy with what was already a lot of very wide-ranging change to the system. A good number of years having gone by now, and other people having worked on that code some more, I'm not too worried about someone calling for a wholesale revert of parallel query. However, there's a second problem here as well, which is that I'm still not sure what the right thing to do is. We've fiddled around with the shutdown sequence for parallel query a number of times now, and I think there's still stuff that doesn't work quite right, especially around getting all of the instrumentation data back to the leader. I haven't spent enough time on this recently enough to be sure what if any problems remain, though. So on the one hand, I don't really like the fact that we have an ad-hoc recursion arrangement here, instead of using planstate_tree_walker or, as Amit proposes, a List. Giving subordinate nodes control over the ordering when they don't really need it just means we have more code with more possibility for bugs and less certainty about whether the theoretical flexibility is doing anything in practice. But on the other hand, because we know that at least for the Gather/GatherMerge case it seems like it probably matters somewhat, it definitely seems appealing not to change anything as part of this patch set that we don't really have to. I've had it firmly in my mind here that we were going to need to change something somehow -- I mean, the possibility of returning in the middle of node initialization seems like a pretty major change to the way this stuff works, and it seems hard for me to believe that we can just do that and not have to adjust any code anywhere else. Can it really be true that we can do that and yet not end up creating any states anywhere with which the current cleanup code is unprepared to cope? Maybe, but it would seem like rather good luck if that's how it shakes out. Still, at the moment, I'm having a hard time understanding what this particular change buys us. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 8, 2023 at 12:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 3, 2023 at 4:37 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Here's a patch set where the refactoring to move the ExecutorStart() > > calls to be closer to GetCachedPlan() (for the call sites that use a > > CachedPlan) is extracted into a separate patch, 0002. Its commit > > message notes an aspect of this refactoring that I feel a bit nervous > > about -- needing to also move the CommandCounterIncrement() call from > > the loop in PortalRunMulti() to PortalStart() which now does > > ExecutorStart() for the PORTAL_MULTI_QUERY case. > > I spent some time today reviewing 0001. Here are a few thoughts and > notes about things that I looked at. Thanks for taking a look at this. > First, I wondered whether it was really adequate for ExecEndPlan() to > just loop over estate->es_plan_nodes and call it good. Put > differently, is it possible that we could ever have more than one > relevant EState, say for a subplan or an EPQ execution or something, > so that this loop wouldn't cover everything? I found nothing to make > me think that this is a real danger. Check. > Second, I wondered whether the ordering of cleanup operations could be > an issue. Right now, a node can position cleanup code before, after, > or both before and after recursing to child nodes, whereas with this > design change, the cleanup code will always be run before recursing to > child nodes. Because a node is appended to es_planstate_nodes at the end of ExecInitNode(), child nodes get added before their parent nodes. So the children are cleaned up first. > Here, I think we have problems. Both ExecGather and > ExecEndGatherMerge intentionally clean up the children before the > parent, so that the child shutdown happens before > ExecParallelCleanup(). Based on the comment and commit > acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be > intentional, and you can sort of see why from looking at the stuff > that happens in ExecParallelCleanup(). If the instrumentation data > vanishes before the child nodes have a chance to clean things up, > maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If > the DSA vanishes, maybe we'll crash if we try to access it. If we > actually reach DestroyParallelContext(), we're just going to start > killing the workers. None of that sounds like what we want. > > The good news, of a sort, is that I think this might be the only case > of this sort of problem. Most nodes recurse at the end, after doing > all the cleanup, so the behavior won't change. Moreover, even if it > did, most cleanup operations look pretty localized -- they affect only > the node itself, and not its children. A somewhat interesting case is > nodes associated with subplans. Right now, because of the coding of > ExecEndPlan, nodes associated with subplans are all cleaned up at the > very end, after everything that's not inside of a subplan. But with > this change, they'd get cleaned up in the order of initialization, > which actually seems more natural, as long as it doesn't break > anything, which I think it probably won't, since as I mention in most > cases node cleanup looks quite localized, i.e. it doesn't care whether > it happens before or after the cleanup of other nodes. > > I think something will have to be done about the parallel query stuff, > though. I'm not sure exactly what. It is a little weird that Gather > and Gather Merge treat starting and killing workers as a purely > "private matter" that they can decide to handle without the executor > overall being very much aware of it. So maybe there's a way that some > of the cleanup logic here could be hoisted up into the general > executor machinery, that is, first end all the nodes, and then go > back, and end all the parallelism using, maybe, another list inside of > the estate. However, I think that the existence of ExecShutdownNode() > is a complication here -- we need to make sure that we don't break > either the case where that happen before overall plan shutdown, or the > case where it doesn't. Given that children are closed before parent, the order of operations in ExecEndGather[Merge] is unchanged. > Third, a couple of minor comments on details of how you actually made > these changes in the patch set. Personally, I would remove all of the > "is closed separately" comments that you added. I think it's a > violation of the general coding principle that you should make the > code look like it's always been that way. Sure, in the immediate > future, people might wonder why you don't need to recurse, but 5 or 10 > years from now that's just going to be clutter. Second, in the cases > where the ExecEndNode functions end up completely empty, I would > suggest just removing the functions entirely and making the switch > that dispatches on the node type have a switch case that lists all the > nodes that don't need a callback here and say /* Nothing do for these > node types */ break;. This will save a few CPU cycles and I think it > will be easier to read as well. I agree with both suggestions. > Fourth, I wonder whether we really need this patch at all. I initially > thought we did, because if we abandon the initialization of a plan > partway through, then we end up with a plan that is in a state that > previously would never have occurred, and we still have to be able to > clean it up. However, perhaps it's a difference without a distinction. > Say we have a partial plan tree, where not all of the PlanState nodes > ever got created. We then just call the existing version of > ExecEndPlan() on it, with no changes. What goes wrong? Sure, we might > call ExecEndNode() on some null pointers where in the current world > there would always be valid pointers, but ExecEndNode() will handle > that just fine, by doing nothing for those nodes, because it starts > with a NULL-check. Well, not all cleanup actions for a given node type are a recursive call to ExecEndNode(), some are also things like this: /* * clean out the tuple table */ ExecClearTuple(node->ps.ps_ResultTupleSlot); But should ExecInitNode() subroutines return the partially initialized PlanState node or NULL on detecting invalidation? If I'm understanding how you think this should be working correctly, I think you mean the former, because if it were the latter, ExecInitNode() would end up returning NULL at the top for the root and then there's nothing to pass to ExecEndNode(), so no way to clean up to begin with. In that case, I think we will need to adjust ExecEndNode() subroutines to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for example. That's something Tom had said he doesn't like very much [1]. Some node types such as Append, BitmapAnd, etc. that contain a list of subplans would need some adjustment, such as using palloc0 for as_appendplans[], etc. so that uninitialized subplans have NULL in the array. There are also issues around ForeignScan, CustomScan ExecEndNode()-time callbacks when they are partially initialized -- is it OK to call the *EndScan callback if the *BeginScan one may not have been called to begin with? Though, perhaps we can adjust the ExecInitNode() subroutines for those to return NULL by opening the relation and checking for invalidation at the beginning instead of in the middle. That should be done for all Scan or leaf-level node types. Anyway, I guess, for the patch's purpose, maybe we should bite the bullet and make those adjustments rather than change ExecEndNode() as proposed. I can give that another try. > Another alternative design might be to switch ExecEndNode to use > planstate_tree_walker to walk the node tree, removing the walk from > the node-type-specific functions as in this patch, and deleting the > end-node functions that are no longer required altogether, as proposed > above. I somehow feel that this would be cleaner than the status quo, > but here again, I'm not sure we really need it. planstate_tree_walker > would just pass over any NULL pointers that it found without doing > anything, but the current code does that too, so while this might be > more beautiful than what we have now, I'm not sure that there's any > real reason to do it. The fact that, like the current patch, it would > change the order in which nodes are cleaned up is also an issue -- the > Gather/Gather Merge ordering issues might be easier to handle this way > with some hack in ExecEndNode() than they are with the design you have > now, but we'd still have to do something about them, I believe. It might be interesting to see if introducing planstate_tree_walker() in ExecEndNode() makes it easier to reason about ExecEndNode() generally speaking, but I think you may be that doing so may not really make matters easier for the partially initialized planstate tree case. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote: > But should ExecInitNode() subroutines return the partially initialized > PlanState node or NULL on detecting invalidation? If I'm > understanding how you think this should be working correctly, I think > you mean the former, because if it were the latter, ExecInitNode() > would end up returning NULL at the top for the root and then there's > nothing to pass to ExecEndNode(), so no way to clean up to begin with. > In that case, I think we will need to adjust ExecEndNode() subroutines > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for > example. That's something Tom had said he doesn't like very much [1]. Yeah, I understood Tom's goal as being "don't return partially initialized nodes." Personally, I'm not sure that's an important goal. In fact, I don't even think it's a desirable one. It doesn't look difficult to audit the end-node functions for cases where they'd fail if a particular pointer were NULL instead of pointing to some real data, and just fixing all such cases to have NULL-tests looks like purely mechanical work that we are unlikely to get wrong. And at least some cases wouldn't require any changes at all. If we don't do that, the complexity doesn't go away. It just moves someplace else. Presumably what we do in that case is have ExecInitNode functions undo any initialization that they've already done before returning NULL. There are basically two ways to do that. Option one is to add code at the point where they return early to clean up anything they've already initialized, but that code is likely to substantially duplicate whatever the ExecEndNode function already knows how to do, and it's very easy for logic like this to get broken if somebody rearranges an ExecInitNode function down the road. Option two is to rearrange the ExecInitNode functions now, to open relations or recurse at the beginning, so that we discover the need to fail before we initialize anything. That restricts our ability to further rearrange the functions in future somewhat, but more importantly, IMHO, it introduces more risk right now. Checking that the ExecEndNode function will not fail if some pointers are randomly null is a lot easier than checking that changing the order of operations in an ExecInitNode function breaks nothing. I'm not here to say that we can't do one of those things. But I think adding null-tests to ExecEndNode functions looks like *far* less work and *way* less risk. There's a second issue here, too, which is when we abort ExecInitNode partway through, how do we signal that? You're rightly pointing out here that if we do that by returning NULL, then we don't do it by returning a pointer to the partially initialized node that we just created, which means that we either need to store those partially initialized nodes in a separate data structure as you propose to do in 0001, or else we need to pick a different signalling convention. We could change (a) ExecInitNode to have an additional argument, bool *kaboom, or (b) we could make it return bool and return the node pointer via a new additional argument, or (c) we could put a Boolean flag into the estate and let the function signal failure by flipping the value of the flag. If we do any of those things, then as far as I can see 0001 is unnecessary. If we do none of them but also avoid creating partially initialized nodes by one of the two techniques mentioned two paragraphs prior, then 0001 is also unnecessary. If we do none of them but do create partially initialized nodes, then we need 0001. So if this were a restaurant menu, then it might look like this: Prix Fixe Menu (choose one from each) First Course - How do we clean up after partial initialization? (1) ExecInitNode functions produce partially initialized nodes (2) ExecInitNode functions get refactored so that the stuff that can cause early exit always happens first, so that no cleanup is ever needed (3) ExecInitNode functions do any required cleanup in situ Second Course - How do we signal that initialization stopped early? (A) Return NULL. (B) Add a bool * out-parmeter to ExecInitNode. (C) Add a Node * out-parameter to ExecInitNode and change the return value to bool. (D) Add a bool to the EState. (E) Something else, maybe. I think that we need 0001 if we choose specifically (1) and (A). My gut feeling is that the least-invasive way to do this project is to choose (1) and (D). My second choice would be (1) and (C), and my third choice would be (1) and (A). If I can't have (1), I think I prefer (2) over (3), but I also believe I prefer hiding in a deep hole to either of them. Maybe I'm not seeing the whole picture correctly here, but both (2) and (3) look awfully painful to me. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote: > > But should ExecInitNode() subroutines return the partially initialized > > PlanState node or NULL on detecting invalidation? If I'm > > understanding how you think this should be working correctly, I think > > you mean the former, because if it were the latter, ExecInitNode() > > would end up returning NULL at the top for the root and then there's > > nothing to pass to ExecEndNode(), so no way to clean up to begin with. > > In that case, I think we will need to adjust ExecEndNode() subroutines > > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for > > example. That's something Tom had said he doesn't like very much [1]. > > Yeah, I understood Tom's goal as being "don't return partially > initialized nodes." > > Personally, I'm not sure that's an important goal. In fact, I don't > even think it's a desirable one. It doesn't look difficult to audit > the end-node functions for cases where they'd fail if a particular > pointer were NULL instead of pointing to some real data, and just > fixing all such cases to have NULL-tests looks like purely mechanical > work that we are unlikely to get wrong. And at least some cases > wouldn't require any changes at all. > > If we don't do that, the complexity doesn't go away. It just moves > someplace else. Presumably what we do in that case is have > ExecInitNode functions undo any initialization that they've already > done before returning NULL. There are basically two ways to do that. > Option one is to add code at the point where they return early to > clean up anything they've already initialized, but that code is likely > to substantially duplicate whatever the ExecEndNode function already > knows how to do, and it's very easy for logic like this to get broken > if somebody rearranges an ExecInitNode function down the road. Yeah, I too am not a fan of making ExecInitNode() clean up partially initialized nodes. > Option > two is to rearrange the ExecInitNode functions now, to open relations > or recurse at the beginning, so that we discover the need to fail > before we initialize anything. That restricts our ability to further > rearrange the functions in future somewhat, but more importantly, > IMHO, it introduces more risk right now. Checking that the ExecEndNode > function will not fail if some pointers are randomly null is a lot > easier than checking that changing the order of operations in an > ExecInitNode function breaks nothing. > > I'm not here to say that we can't do one of those things. But I think > adding null-tests to ExecEndNode functions looks like *far* less work > and *way* less risk. +1 > There's a second issue here, too, which is when we abort ExecInitNode > partway through, how do we signal that? You're rightly pointing out > here that if we do that by returning NULL, then we don't do it by > returning a pointer to the partially initialized node that we just > created, which means that we either need to store those partially > initialized nodes in a separate data structure as you propose to do in > 0001, > > or else we need to pick a different signalling convention. We > could change (a) ExecInitNode to have an additional argument, bool > *kaboom, or (b) we could make it return bool and return the node > pointer via a new additional argument, or (c) we could put a Boolean > flag into the estate and let the function signal failure by flipping > the value of the flag. The failure can already be detected by seeing that ExecPlanIsValid(estate) is false. The question is what ExecInitNode() or any of its subroutines should return once it is. I think the following convention works: Return partially initialized state from ExecInit* function where we detect the invalidation after calling ExecInitNode() on a child plan, so that ExecEndNode() can recurse to clean it up. Return NULL from ExecInit* functions where we detect the invalidation after opening and locking a relation but before calling ExecInitNode() to initialize a child plan if there's one at all. Even if we may set things like ExprContext, TupleTableSlot fields, they are cleaned up independently of the plan tree anyway via the cleanup called with es_exprcontexts, es_tupleTable, respectively. I even noticed bits like this in ExecEnd* functions: - /* - * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext - */ -#ifdef NOT_USED - ExecFreeExprContext(&node->ss.ps); - if (node->ioss_RuntimeContext) - FreeExprContext(node->ioss_RuntimeContext, true); -#endif So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions is unnecessary but remain around because nobody cared about and got around to getting rid of it. > If we do any of those things, then as far as I > can see 0001 is unnecessary. If we do none of them but also avoid > creating partially initialized nodes by one of the two techniques > mentioned two paragraphs prior, then 0001 is also unnecessary. If we > do none of them but do create partially initialized nodes, then we > need 0001. > > So if this were a restaurant menu, then it might look like this: > > Prix Fixe Menu (choose one from each) > > First Course - How do we clean up after partial initialization? > (1) ExecInitNode functions produce partially initialized nodes > (2) ExecInitNode functions get refactored so that the stuff that can > cause early exit always happens first, so that no cleanup is ever > needed > (3) ExecInitNode functions do any required cleanup in situ > > Second Course - How do we signal that initialization stopped early? > (A) Return NULL. > (B) Add a bool * out-parmeter to ExecInitNode. > (C) Add a Node * out-parameter to ExecInitNode and change the return > value to bool. > (D) Add a bool to the EState. > (E) Something else, maybe. > > I think that we need 0001 if we choose specifically (1) and (A). My > gut feeling is that the least-invasive way to do this project is to > choose (1) and (D). My second choice would be (1) and (C), and my > third choice would be (1) and (A). If I can't have (1), I think I > prefer (2) over (3), but I also believe I prefer hiding in a deep hole > to either of them. Maybe I'm not seeing the whole picture correctly > here, but both (2) and (3) look awfully painful to me. I think what I've ended up with in the attached 0001 (WIP) is both (1), (2), and (D). As mentioned above, (D) is implemented with the ExecPlanStillValid() function. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v45-0006-Track-opened-range-table-relations-in-a-List-in-.patch
- v45-0003-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v45-0005-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v45-0004-Set-inFromCl-to-false-in-child-table-RTEs.patch
- v45-0002-Refactoring-to-move-ExecutorStart-calls-to-be-ne.patch
- v45-0001-Add-support-for-allowing-ExecInitNode-to-detect-.patch
On Fri, Aug 11, 2023 at 14:31 Amit Langote <amitlangote09@gmail.com> wrote:
On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > But should ExecInitNode() subroutines return the partially initialized
> > PlanState node or NULL on detecting invalidation? If I'm
> > understanding how you think this should be working correctly, I think
> > you mean the former, because if it were the latter, ExecInitNode()
> > would end up returning NULL at the top for the root and then there's
> > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > In that case, I think we will need to adjust ExecEndNode() subroutines
> > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > example. That's something Tom had said he doesn't like very much [1].
>
> Yeah, I understood Tom's goal as being "don't return partially
> initialized nodes."
>
> Personally, I'm not sure that's an important goal. In fact, I don't
> even think it's a desirable one. It doesn't look difficult to audit
> the end-node functions for cases where they'd fail if a particular
> pointer were NULL instead of pointing to some real data, and just
> fixing all such cases to have NULL-tests looks like purely mechanical
> work that we are unlikely to get wrong. And at least some cases
> wouldn't require any changes at all.
>
> If we don't do that, the complexity doesn't go away. It just moves
> someplace else. Presumably what we do in that case is have
> ExecInitNode functions undo any initialization that they've already
> done before returning NULL. There are basically two ways to do that.
> Option one is to add code at the point where they return early to
> clean up anything they've already initialized, but that code is likely
> to substantially duplicate whatever the ExecEndNode function already
> knows how to do, and it's very easy for logic like this to get broken
> if somebody rearranges an ExecInitNode function down the road.
Yeah, I too am not a fan of making ExecInitNode() clean up partially
initialized nodes.
> Option
> two is to rearrange the ExecInitNode functions now, to open relations
> or recurse at the beginning, so that we discover the need to fail
> before we initialize anything. That restricts our ability to further
> rearrange the functions in future somewhat, but more importantly,
> IMHO, it introduces more risk right now. Checking that the ExecEndNode
> function will not fail if some pointers are randomly null is a lot
> easier than checking that changing the order of operations in an
> ExecInitNode function breaks nothing.
>
> I'm not here to say that we can't do one of those things. But I think
> adding null-tests to ExecEndNode functions looks like *far* less work
> and *way* less risk.
+1
> There's a second issue here, too, which is when we abort ExecInitNode
> partway through, how do we signal that? You're rightly pointing out
> here that if we do that by returning NULL, then we don't do it by
> returning a pointer to the partially initialized node that we just
> created, which means that we either need to store those partially
> initialized nodes in a separate data structure as you propose to do in
> 0001,
>
> or else we need to pick a different signalling convention. We
> could change (a) ExecInitNode to have an additional argument, bool
> *kaboom, or (b) we could make it return bool and return the node
> pointer via a new additional argument, or (c) we could put a Boolean
> flag into the estate and let the function signal failure by flipping
> the value of the flag.
The failure can already be detected by seeing that
ExecPlanIsValid(estate) is false. The question is what ExecInitNode()
or any of its subroutines should return once it is. I think the
following convention works:
Return partially initialized state from ExecInit* function where we
detect the invalidation after calling ExecInitNode() on a child plan,
so that ExecEndNode() can recurse to clean it up.
Return NULL from ExecInit* functions where we detect the invalidation
after opening and locking a relation but before calling ExecInitNode()
to initialize a child plan if there's one at all. Even if we may set
things like ExprContext, TupleTableSlot fields, they are cleaned up
independently of the plan tree anyway via the cleanup called with
es_exprcontexts, es_tupleTable, respectively. I even noticed bits
like this in ExecEnd* functions:
- /*
- * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
- */
-#ifdef NOT_USED
- ExecFreeExprContext(&node->ss.ps);
- if (node->ioss_RuntimeContext)
- FreeExprContext(node->ioss_RuntimeContext, true);
-#endif
So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
is unnecessary but remain around because nobody cared about and got
around to getting rid of it.
> If we do any of those things, then as far as I
> can see 0001 is unnecessary. If we do none of them but also avoid
> creating partially initialized nodes by one of the two techniques
> mentioned two paragraphs prior, then 0001 is also unnecessary. If we
> do none of them but do create partially initialized nodes, then we
> need 0001.
>
> So if this were a restaurant menu, then it might look like this:
>
> Prix Fixe Menu (choose one from each)
>
> First Course - How do we clean up after partial initialization?
> (1) ExecInitNode functions produce partially initialized nodes
> (2) ExecInitNode functions get refactored so that the stuff that can
> cause early exit always happens first, so that no cleanup is ever
> needed
> (3) ExecInitNode functions do any required cleanup in situ
>
> Second Course - How do we signal that initialization stopped early?
> (A) Return NULL.
> (B) Add a bool * out-parmeter to ExecInitNode.
> (C) Add a Node * out-parameter to ExecInitNode and change the return
> value to bool.
> (D) Add a bool to the EState.
> (E) Something else, maybe.
>
> I think that we need 0001 if we choose specifically (1) and (A). My
> gut feeling is that the least-invasive way to do this project is to
> choose (1) and (D). My second choice would be (1) and (C), and my
> third choice would be (1) and (A). If I can't have (1), I think I
> prefer (2) over (3), but I also believe I prefer hiding in a deep hole
> to either of them. Maybe I'm not seeing the whole picture correctly
> here, but both (2) and (3) look awfully painful to me.
I think what I've ended up with in the attached 0001 (WIP) is both
(1), (2), and (D). As mentioned above, (D) is implemented with the
ExecPlanStillValid() function.
After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is remove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes. We could instead have ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the close-child-before-self behavior for Gather* nodes, and close node type specific cleanup functions for nodes that do have any local cleanup to do. Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual bottom so that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions.
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
EDB: http://www.enterprisedb.com
On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote: > After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is removethe functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes. We could instead haveExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do haveany local cleanup to do. Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. I think 0001 needs to be split up. Like, this is code cleanup: - /* - * Free the exprcontext - */ - ExecFreeExprContext(&node->ss.ps); This is providing for NULL pointers where we don't currently: - list_free_deep(aggstate->hash_batches); + if (aggstate->hash_batches) + list_free_deep(aggstate->hash_batches); And this is the early return mechanism per se: + if (!ExecPlanStillValid(estate)) + return aggstate; I think at least those 3 kinds of changes deserve to be in separate patches with separate commit messages explaining the rationale behind each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions. These calls are no longer required, because <reasons>. Removing them saves a few CPU cycles and simplifies planned refactoring, so do that." -- Robert Haas EDB: http://www.enterprisedb.com
Thanks for taking a look. On Mon, Aug 28, 2023 at 10:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote: > > After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do isremove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes. We could insteadhave ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do haveany local cleanup to do. Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. > > I think 0001 needs to be split up. Like, this is code cleanup: > > - /* > - * Free the exprcontext > - */ > - ExecFreeExprContext(&node->ss.ps); > > This is providing for NULL pointers where we don't currently: > > - list_free_deep(aggstate->hash_batches); > + if (aggstate->hash_batches) > + list_free_deep(aggstate->hash_batches); > > And this is the early return mechanism per se: > > + if (!ExecPlanStillValid(estate)) > + return aggstate; > > I think at least those 3 kinds of changes deserve to be in separate > patches with separate commit messages explaining the rationale behind > each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions. > These calls are no longer required, because <reasons>. Removing them > saves a few CPU cycles and simplifies planned refactoring, so do > that." Breaking up the patch as you describe makes sense, so I've done that: Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines. 0002 adds NULLness checks in ExecEnd*() routines on some pointers that may not be initialized by the corresponding ExecInit*() routines in the case where it returns early. 0003 adds the early return mechanism based on checking CachedPlan invalidation, though no CachedPlan is actually passed to the executor yet, so no functional changes here yet. Other patches are rebased over these. One significant change is in 0004 which does the refactoring to make the callers of ExecutorStart() aware that it may now return with a partially initialized planstate tree that should not be executed. I added a new flag EState.es_canceled to denote that state of the execution to complement the existing es_finished. I also needed to add AfterTriggerCancelQuery() to ensure that we don't attempt to fire a canceled query's triggers. Most of these changes are needed only to appease the various Asserts in these parts of the code and I thought they are warranted given the introduction of a new state of query execution. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v46-0004-Make-ExecutorStart-return-early-upon-plan-invali.patch
- v46-0006-Set-inFromCl-to-false-in-child-table-RTEs.patch
- v46-0005-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v46-0008-Track-opened-range-table-relations-in-a-List-in-.patch
- v46-0007-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v46-0003-Support-for-ExecInitNode-to-detect-CachedPlan-in.patch
- v46-0001-Refactor-ExecEnd-routines-to-enhance-efficiency.patch
- v46-0002-Check-pointer-NULLness-before-cleanup-in-ExecEnd.patch
On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote: > Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines. It also adds a few random Assert()s to verify that unrelated pointers are not NULL. I suggest that it shouldn't do that. The commit message doesn't mention the removal of the calls to ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I think it would be nice to mention it in the commit message, assuming that it is in fact OK. I suggest changing the subject line of the commit to something like "Remove obsolete executor cleanup code." > 0002 adds NULLness checks in ExecEnd*() routines on some pointers that > may not be initialized by the corresponding ExecInit*() routines in > the case where it returns early. I think you should only add these where it's needed. For example, I think list_free_deep(NIL) is fine. The changes to ExecEndForeignScan look like they include stuff that belongs in 0001. Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to implicit ones like if (x), but opinions vary. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 5, 2023 at 11:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines. > > It also adds a few random Assert()s to verify that unrelated pointers > are not NULL. I suggest that it shouldn't do that. OK, removed. > The commit message doesn't mention the removal of the calls to > ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I > think it would be nice to mention it in the commit message, assuming > that it is in fact OK. That is not OK, so I dropped their removal. I think I confused them with slots in other functions initialized with ExecInitExtraTupleSlot() that *are* put into the estate. > I suggest changing the subject line of the commit to something like > "Remove obsolete executor cleanup code." Sure. > > 0002 adds NULLness checks in ExecEnd*() routines on some pointers that > > may not be initialized by the corresponding ExecInit*() routines in > > the case where it returns early. > > I think you should only add these where it's needed. For example, I > think list_free_deep(NIL) is fine. OK, done. > The changes to ExecEndForeignScan look like they include stuff that > belongs in 0001. Oops, yes. Moved to 0001. > Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to > implicit ones like if (x), but opinions vary. I agree, so changed all the new tests to use (x != NULL) form. Typically, I try to stick with whatever style is used in the nearby code, though I can see both styles being used in the ExecEnd*() routines. I opted to use the style that we both happen to prefer. Attached updated patches. Thanks for the review. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v47-0005-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v47-0006-Set-inFromCl-to-false-in-child-table-RTEs.patch
- v47-0008-Track-opened-range-table-relations-in-a-List-in-.patch
- v47-0007-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v47-0004-Adjustments-to-allow-ExecutorStart-to-sometimes-.patch
- v47-0003-Support-for-ExecInitNode-to-detect-CachedPlan-in.patch
- v47-0002-Check-pointer-NULLness-before-cleanup-in-ExecEnd.patch
- v47-0001-Remove-obsolete-executor-cleanup-code.patch
On Wed, Sep 6, 2023 at 5:12 AM Amit Langote <amitlangote09@gmail.com> wrote: > Attached updated patches. Thanks for the review. I think 0001 looks ready to commit. I'm not sure that the commit message needs to mention future patches here, since this code cleanup seems like a good idea regardless, but if you feel otherwise, fair enough. On 0002, some questions: - In ExecEndLockRows, is the call to EvalPlanQualEnd a concern? i.e. Does that function need any adjustment? - In ExecEndMemoize, should there be a null-test around MemoryContextDelete(node->tableContext) as we have in ExecEndRecursiveUnion, ExecEndSetOp, etc.? I wonder how we feel about setting pointers to NULL after freeing the associated data structures. The existing code isn't consistent about doing that, and making it do so would be a fairly large change that would bloat this patch quite a bit. On the other hand, I think it's a good practice as a general matter, and we do do it in some ExecEnd functions. On 0003, I have some doubt about whether we really have all the right design decisions in detail here: - Why have this weird rule where sometimes we return NULL and other times the planstate? Is there any point to such a coding rule? Why not just always return the planstate? - Is there any point to all of these early exit cases? For example, in ExecInitBitmapAnd, why exit early if initialization fails? Why not just plunge ahead and if initialization failed the caller will notice that and when we ExecEndNode some of the child node pointers will be NULL but who cares? The obvious disadvantage of this approach is that we're doing a bunch of unnecessary initialization, but we're also speeding up the common case where we don't need to abort by avoiding a branch that will rarely be taken. I'm not quite sure what the right thing to do is here. - The cases where we call ExecGetRangeTableRelation or ExecOpenScanRelation are a bit subtler ... maybe initialization that we're going to do later is going to barf if the tuple descriptor of the relation isn't what we thought it was going to be. In that case it becomes important to exit early. But if that's not actually a problem, then we could apply the same principle here also -- don't pollute the code with early-exit cases, just let it do its thing and sort it out later. Do you know what the actual problems would be here if we didn't exit early in these cases? - Depending on the answers to the above points, one thing we could think of doing is put an early exit case into ExecInitNode itself: if (unlikely(!ExecPlanStillValid(whatever)) return NULL. Maybe Andres or someone is going to argue that that checks too often and is thus too expensive, but it would be a lot more maintainable than having similar checks strewn throughout the ExecInit* functions. Perhaps it deserves some thought/benchmarking. More generally, if there's anything we can do to centralize these checks in fewer places, I think that would be worth considering. The patch isn't terribly large as it stands, so I don't necessarily think that this is a critical issue, but I'm just wondering if we can do better. I'm not even sure that it would be too expensive to just initialize the whole plan always, and then just do one test at the end. That's not OK if the changed tuple descriptor (or something else) is going to crash or error out in a funny way or something before initialization is completed, but if it's just going to result in burning a few CPU cycles in a corner case, I don't know if we should really care. - The "At this point" comments don't give any rationale for why we shouldn't have received any such invalidation messages. That makes them fairly useless; the Assert by itself clarifies that you think that case shouldn't happen. The comment's job is to justify that claim. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 6, 2023 at 5:12 AM Amit Langote <amitlangote09@gmail.com> wrote: > > Attached updated patches. Thanks for the review. > > I think 0001 looks ready to commit. I'm not sure that the commit > message needs to mention future patches here, since this code cleanup > seems like a good idea regardless, but if you feel otherwise, fair > enough. OK, I will remove the mention of future patches. > On 0002, some questions: > > - In ExecEndLockRows, is the call to EvalPlanQualEnd a concern? i.e. > Does that function need any adjustment? I think it does with the patch as it stands. It needs to have an early exit at the top if parentestate is NULL, which it would be if EvalPlanQualInit() wasn't called from an ExecInit*() function. Though, as I answer below your question as to whether there is actually any need to interrupt all of the ExecInit*() routines, nothing needs to change in ExecEndLockRows(). > - In ExecEndMemoize, should there be a null-test around > MemoryContextDelete(node->tableContext) as we have in > ExecEndRecursiveUnion, ExecEndSetOp, etc.? Oops, you're right. Added. > I wonder how we feel about setting pointers to NULL after freeing the > associated data structures. The existing code isn't consistent about > doing that, and making it do so would be a fairly large change that > would bloat this patch quite a bit. On the other hand, I think it's a > good practice as a general matter, and we do do it in some ExecEnd > functions. I agree that it might be worthwhile to take the opportunity and make the code more consistent in this regard. So, I've included those changes too in 0002. > On 0003, I have some doubt about whether we really have all the right > design decisions in detail here: > > - Why have this weird rule where sometimes we return NULL and other > times the planstate? Is there any point to such a coding rule? Why not > just always return the planstate? > > - Is there any point to all of these early exit cases? For example, in > ExecInitBitmapAnd, why exit early if initialization fails? Why not > just plunge ahead and if initialization failed the caller will notice > that and when we ExecEndNode some of the child node pointers will be > NULL but who cares? The obvious disadvantage of this approach is that > we're doing a bunch of unnecessary initialization, but we're also > speeding up the common case where we don't need to abort by avoiding a > branch that will rarely be taken. I'm not quite sure what the right > thing to do is here. > > - The cases where we call ExecGetRangeTableRelation or > ExecOpenScanRelation are a bit subtler ... maybe initialization that > we're going to do later is going to barf if the tuple descriptor of > the relation isn't what we thought it was going to be. In that case it > becomes important to exit early. But if that's not actually a problem, > then we could apply the same principle here also -- don't pollute the > code with early-exit cases, just let it do its thing and sort it out > later. Do you know what the actual problems would be here if we didn't > exit early in these cases? > > - Depending on the answers to the above points, one thing we could > think of doing is put an early exit case into ExecInitNode itself: if > (unlikely(!ExecPlanStillValid(whatever)) return NULL. Maybe Andres or > someone is going to argue that that checks too often and is thus too > expensive, but it would be a lot more maintainable than having similar > checks strewn throughout the ExecInit* functions. Perhaps it deserves > some thought/benchmarking. More generally, if there's anything we can > do to centralize these checks in fewer places, I think that would be > worth considering. The patch isn't terribly large as it stands, so I > don't necessarily think that this is a critical issue, but I'm just > wondering if we can do better. I'm not even sure that it would be too > expensive to just initialize the whole plan always, and then just do > one test at the end. That's not OK if the changed tuple descriptor (or > something else) is going to crash or error out in a funny way or > something before initialization is completed, but if it's just going > to result in burning a few CPU cycles in a corner case, I don't know > if we should really care. I thought about this some and figured that adding the is-CachedPlan-still-valid tests in the following places should suffice after all: 1. In InitPlan() right after the top-level ExecInitNode() calls 2. In ExecInit*() functions of Scan nodes, right after ExecOpenScanRelation() calls CachedPlans can only become invalid because of concurrent changes to the inheritance child tables referenced in the plan. Only the following schema modifications of child tables are possible to be performed concurrently: * Addition of a column (allowed only if traditional inheritance child) * Addition of an index * Addition of a non-index constraint * Dropping of a child table (allowed only if traditional inheritance child) * Dropping of an index referenced in the plan The first 3 are not destructive enough to cause crashes, weird errors during ExecInit*(), though the last two can be, so the 2nd set of the tests after ExecOpenScanRelation() mentioned above. > - The "At this point" comments don't give any rationale for why we > shouldn't have received any such invalidation messages. That makes > them fairly useless; the Assert by itself clarifies that you think > that case shouldn't happen. The comment's job is to justify that > claim. I've rewritten the comments. I'll post the updated set of patches shortly. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > - Is there any point to all of these early exit cases? For example, in > > ExecInitBitmapAnd, why exit early if initialization fails? Why not > > just plunge ahead and if initialization failed the caller will notice > > that and when we ExecEndNode some of the child node pointers will be > > NULL but who cares? The obvious disadvantage of this approach is that > > we're doing a bunch of unnecessary initialization, but we're also > > speeding up the common case where we don't need to abort by avoiding a > > branch that will rarely be taken. I'm not quite sure what the right > > thing to do is here. > I thought about this some and figured that adding the > is-CachedPlan-still-valid tests in the following places should suffice > after all: > > 1. In InitPlan() right after the top-level ExecInitNode() calls > 2. In ExecInit*() functions of Scan nodes, right after > ExecOpenScanRelation() calls After sleeping on this, I think we do need the checks after all the ExecInitNode() calls too, because we have many instances of the code like the following one: outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags); tupDesc = ExecGetResultType(outerPlanState(gatherstate)); <some code that dereferences outDesc> If outerNode is a SeqScan and ExecInitSeqScan() returned early because ExecOpenScanRelation() detected that plan was invalidated, then tupDesc would be NULL in this case, causing the code to crash. Now one might say that perhaps we should only add the is-CachedPlan-valid test in the instances where there is an actual risk of such misbehavior, but that could lead to confusion, now or later. It seems better to add them after every ExecInitNode() call while we're inventing the notion, because doing so relieves the authors of future enhancements of the ExecInit*() routines from worrying about any of this. Attached 0003 should show how that turned out. Updated 0002 as mentioned in the previous reply -- setting pointers to NULL after freeing them more consistently across various ExecEnd*() routines and using the `if (pointer != NULL)` style over the `if (pointer)` more consistently. Updated 0001's commit message to remove the mention of its relation to any future commits. I intend to push it tomorrow. Patches 0004 onwards contain changes too, mainly in terms of moving the code around from one patch to another, but I'll omit the details of the specific change for now. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v47-0005-Teach-the-executor-to-lock-child-tables-in-some-.patch
- v47-0007-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v47-0009-Track-opened-range-table-relations-in-a-List-in-.patch
- v47-0006-Assert-that-relations-needing-their-permissions-.patch
- v47-0008-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v47-0002-Check-pointer-NULLness-before-cleanup-in-ExecEnd.patch
- v47-0004-Adjustments-to-allow-ExecutorStart-to-sometimes-.patch
- v47-0003-Prepare-executor-to-support-detecting-CachedPlan.patch
- v47-0001-Remove-obsolete-executor-cleanup-code.patch
On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > - Is there any point to all of these early exit cases? For example, in > > > ExecInitBitmapAnd, why exit early if initialization fails? Why not > > > just plunge ahead and if initialization failed the caller will notice > > > that and when we ExecEndNode some of the child node pointers will be > > > NULL but who cares? The obvious disadvantage of this approach is that > > > we're doing a bunch of unnecessary initialization, but we're also > > > speeding up the common case where we don't need to abort by avoiding a > > > branch that will rarely be taken. I'm not quite sure what the right > > > thing to do is here. > > I thought about this some and figured that adding the > > is-CachedPlan-still-valid tests in the following places should suffice > > after all: > > > > 1. In InitPlan() right after the top-level ExecInitNode() calls > > 2. In ExecInit*() functions of Scan nodes, right after > > ExecOpenScanRelation() calls > > After sleeping on this, I think we do need the checks after all the > ExecInitNode() calls too, because we have many instances of the code > like the following one: > > outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags); > tupDesc = ExecGetResultType(outerPlanState(gatherstate)); > <some code that dereferences outDesc> > > If outerNode is a SeqScan and ExecInitSeqScan() returned early because > ExecOpenScanRelation() detected that plan was invalidated, then > tupDesc would be NULL in this case, causing the code to crash. > > Now one might say that perhaps we should only add the > is-CachedPlan-valid test in the instances where there is an actual > risk of such misbehavior, but that could lead to confusion, now or > later. It seems better to add them after every ExecInitNode() call > while we're inventing the notion, because doing so relieves the > authors of future enhancements of the ExecInit*() routines from > worrying about any of this. > > Attached 0003 should show how that turned out. > > Updated 0002 as mentioned in the previous reply -- setting pointers to > NULL after freeing them more consistently across various ExecEnd*() > routines and using the `if (pointer != NULL)` style over the `if > (pointer)` more consistently. > > Updated 0001's commit message to remove the mention of its relation to > any future commits. I intend to push it tomorrow. Pushed that one. Here are the rebased patches. 0001 seems ready to me, but I'll wait a couple more days for others to weigh in. Just to highlight a kind of change that others may have differing opinions on, consider this hunk from the patch: - MemoryContextDelete(node->aggcontext); + if (node->aggcontext != NULL) + { + MemoryContextDelete(node->aggcontext); + node->aggcontext = NULL; + } ... + ExecEndNode(outerPlanState(node)); + outerPlanState(node) = NULL; So the patch wants to enhance the consistency of setting the pointer to NULL after freeing part. Robert mentioned his preference for doing it in the patch, which I agree with. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v48-0007-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v48-0006-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v48-0008-Track-opened-range-table-relations-in-a-List-in-.patch
- v48-0005-Assert-that-relations-needing-their-permissions-.patch
- v48-0004-Teach-the-executor-to-lock-child-tables-in-some-.patch
- v48-0003-Adjustments-to-allow-ExecutorStart-to-sometimes-.patch
- v48-0001-Assorted-tightening-in-various-ExecEnd-routines.patch
- v48-0002-Prepare-executor-to-support-detecting-CachedPlan.patch
On Thu, Sep 28, 2023 at 5:26 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote: > > After sleeping on this, I think we do need the checks after all the > > ExecInitNode() calls too, because we have many instances of the code > > like the following one: > > > > outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags); > > tupDesc = ExecGetResultType(outerPlanState(gatherstate)); > > <some code that dereferences outDesc> > > > > If outerNode is a SeqScan and ExecInitSeqScan() returned early because > > ExecOpenScanRelation() detected that plan was invalidated, then > > tupDesc would be NULL in this case, causing the code to crash. > > > > Now one might say that perhaps we should only add the > > is-CachedPlan-valid test in the instances where there is an actual > > risk of such misbehavior, but that could lead to confusion, now or > > later. It seems better to add them after every ExecInitNode() call > > while we're inventing the notion, because doing so relieves the > > authors of future enhancements of the ExecInit*() routines from > > worrying about any of this. > > > > Attached 0003 should show how that turned out. > > > > Updated 0002 as mentioned in the previous reply -- setting pointers to > > NULL after freeing them more consistently across various ExecEnd*() > > routines and using the `if (pointer != NULL)` style over the `if > > (pointer)` more consistently. > > > > Updated 0001's commit message to remove the mention of its relation to > > any future commits. I intend to push it tomorrow. > > Pushed that one. Here are the rebased patches. > > 0001 seems ready to me, but I'll wait a couple more days for others to > weigh in. Just to highlight a kind of change that others may have > differing opinions on, consider this hunk from the patch: > > - MemoryContextDelete(node->aggcontext); > + if (node->aggcontext != NULL) > + { > + MemoryContextDelete(node->aggcontext); > + node->aggcontext = NULL; > + } > ... > + ExecEndNode(outerPlanState(node)); > + outerPlanState(node) = NULL; > > So the patch wants to enhance the consistency of setting the pointer > to NULL after freeing part. Robert mentioned his preference for doing > it in the patch, which I agree with. Rebased. I haven't been able to reproduce and debug a crash reported by cfbot that I see every now and then: https://cirrus-ci.com/task/5673432591892480?logs=cores#L0 [22:46:12.328] Program terminated with signal SIGSEGV, Segmentation fault. [22:46:12.328] Address not mapped to object. [22:46:12.838] #0 afterTriggerInvokeEvents (events=events@entry=0x836db0460, firing_id=1, estate=estate@entry=0x842eec100, delete_ok=<optimized out>) at ../src/backend/commands/trigger.c:4656 [22:46:12.838] #1 0x00000000006c67a8 in AfterTriggerEndQuery (estate=estate@entry=0x842eec100) at ../src/backend/commands/trigger.c:5085 [22:46:12.838] #2 0x000000000065bfba in CopyFrom (cstate=0x836df9038) at ../src/backend/commands/copyfrom.c:1293 ... While a patch in this series does change src/backend/commands/trigger.c, I'm not yet sure about its relation with the backtrace shown there. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Attachment
- v49-0006-Add-field-to-store-parent-relids-to-Append-Merge.patch
- v49-0007-Delay-locking-of-child-tables-in-cached-plans-un.patch
- v49-0005-Assert-that-relations-needing-their-permissions-.patch
- v49-0004-Teach-the-executor-to-lock-child-tables-in-some-.patch
- v49-0008-Track-opened-range-table-relations-in-a-List-in-.patch
- v49-0002-Prepare-executor-to-support-detecting-CachedPlan.patch
- v49-0001-Assorted-tightening-in-various-ExecEnd-routines.patch
- v49-0003-Adjustments-to-allow-ExecutorStart-to-sometimes-.patch
Reviewing 0001: Perhaps ExecEndCteScan needs an adjustment. What if node->leader was never set? Other than that, I think this is in good shape. Maybe there are other things we'd want to adjust here, or maybe there aren't, but there doesn't seem to be any good reason to bundle more changes into the same patch. Reviewing 0002 and beyond: I think it's good that you have tried to divide up a big change into little pieces, but I'm finding the result difficult to understand. It doesn't really seem like each patch stands on its own. I keep flipping between patches to try to understand why other patches are doing things, which kind of defeats the purpose of splitting stuff up. For example, 0002 adds a NodeTag field to QueryDesc, but it doesn't even seem to initialize that field, let alone use it for anything. It adds a CachedPlan pointer to QueryDesc too, and adapts CreateQueryDesc to allow one as an argument, but none of the callers actually pass anything. I suspect that that the first change (adding a NodeTag) field is a bug, and that the second one is intentional, but it's hard to tell without flipping through all of the other patches to see how they build on what 0002 does. And even when something isn't a bug, it's also hard to tell whether it's the right design, again because you can't consider each patch in isolation. Ideally, splitting a patch set should bring related changes together in a single patch and push unrelated changes apart into different patches, but I don't really see this particular split having that effect. There is a chicken and egg problem here, to be fair. If we add code that can make plan initialization fail without teaching the planner to cope with failures, then we have broken the server, and if we do the reverse, then we have a bunch of dead code that we can't test. Neither is very satisfactory. But I still hope there's some better division possible than what you have here currently. For instance, I wonder if it would be possible to add all the stuff to cope with plan initialization failing and then have a test patch that makes initialization randomly fail with some probability (or maybe you can even cause failures at specific points). Then you could test that infrastructure by running the regression tests in a loop with various values of the relevant setting. Another overall comment that I have is that it doesn't feel like there's enough high-level explanation of the design. I don't know how much of that should go in comments vs. commit messages vs. a README that accompanies the patch set vs. whatever else, and I strongly suspect that some of the stuff that seems confusing now is actually stuff that at one point I understood and have just forgotten about. But rediscovering it shouldn't be quite so hard. For example, consider the question "why are we storing the CachedPlan in the QueryDesc?" I eventually figured out that it's so that ExecPlanStillValid can call CachedPlanStillValid which can then consult the cached plan's is_valid flag. But is that the only access to the CachedPlan that we ever expect to occur via the QueryDesc? If not, what else is allowable? If so, why not just store a Boolean in the QueryDesc and arrange for the plancache to be able to flip it when invalidating? I'm not saying that's a better design -- I'm saying that it looks hard to understand your thought process from the patch set. And also, you know, assuming the current design is correct, could there be some way of dividing up the patch set so that this one change, where we add the CachedPlan to the QueryDesc, isn't so spread out across the whole series? Some more detailed review comments below. This isn't really a full review because I don't understand the patches well enough for that, but it's some stuff I noticed. In 0002: + * result-rel info, etc. Also, we don't pass the parent't copy of the Typo. + /* + * All the necessary locks must already have been taken when + * initializing the parent's copy of subplanstate, so the CachedPlan, + * if any, should not have become invalid during ExecInitNode(). + */ + Assert(ExecPlanStillValid(rcestate)); This -- and the other similar instance -- feel very uncomfortable. There's a lot of action at a distance here. If this assertion ever failed, how would anyone ever figure out what went wrong? You wouldn't for example know which object got invalidated, presumably corresponding to a lock that you failed to take. Unless the problem were easily reproducible in a test environment, trying to guess what happened might be pretty awful; imagine seeing this assertion failure in a customer log file and trying to back-track to the find the underlying bug. A further problem is that what would actually happen is you *wouldn't* see this in the customer log file, because assertions wouldn't be enabled, so you'd just see queries occasionally returning wrong answers, I guess? Or crashing in some other random part of the code? Which seems even worse. At a minimum I think this should be upgraded to a test-and-elog, and maybe there's some value in trying to think of what should get printed by that elog to facilitate proper debugging, if it happens. In 0003: + * + * OK to ignore the return value; plan can't become invalid, + * because there's no CachedPlan. */ - ExecutorStart(cstate->queryDesc, 0); + (void) ExecutorStart(cstate->queryDesc, 0); This also feels awkward, for similar reasons. Sure, it shouldn't return false, but also, if it did, you'd just blindly continue. Maybe there should be test-and-elog here too. Or maybe this is an indication that we need less action at a distance. Like, if ExecutorStart took the CachedPlan as an argument instead of feeding it through the QueryDesc, then you could document that ExecutorStart returns true if that value is passed as NULL and true or false otherwise. Here, whether ExecutorStart can return true or false depends on the contents of the queryDesc ... which, granted, in this case is just built a line or two before anyway, but if you just passed to to ExecutorStart then you wouldn't need to feed it through the QueryDesc, it seems to me. Even better, maybe there should be ExecutorStart() that continues returning void and ExecutorStartExtended() that takes a cached plan as an additional argument and returns a bool. /* - * Check that ExecutorFinish was called, unless in EXPLAIN-only mode. This - * Assert is needed because ExecutorFinish is new as of 9.1, and callers - * might forget to call it. + * Check that ExecutorFinish was called, unless in EXPLAIN-only mode or if + * execution was canceled. This Assert is needed because ExecutorFinish is + * new as of 9.1, and callers might forget to call it. */ Maybe we could drop the second sentence at this point. In 0005: + * XXX Maybe we should we skip calling ExecCheckPermissions from + * InitPlan in a parallel worker. Why? If the thinking is to save overhead, then perhaps try to assess the overhead. If the thinking is that we don't want it to fail spuriously, then we have to weight that against the (security) risk of succeeding spuriously. + * Returns true if current transaction holds a lock on the given relation of + * mode 'lockmode'. If 'orstronger' is true, a stronger lockmode is also OK. + * ("Stronger" is defined as "numerically higher", which is a bit + * semantically dubious but is OK for the purposes we use this for.) I don't particularly enjoy seeing this comment cut and pasted into some new place. Especially the tongue-in-cheek parenthetical part. Better to refer to the original comment or something instead of cut-and-pasting. Also, why is it appropriate to pass orstronger = true here? Don't we expect the *exact* lock mode that we have planned to be held, and isn't it a sure sign of a bug if it isn't? Maybe orstronger should just be ripped out here (and the comment could then go away too). In 0006: + /* + * RTIs of all partitioned tables whose children are scanned by + * appendplans. The list contains a bitmapset for every partition tree + * covered by this Append. + */ The first sentence of this comment makes this sound like a list of integers, the RTIs of all partitioned tables that are scanned. The second sentence makes it sound like a list of bitmapsets, but what does it mean to take about each partition tree covered by this Append? This is far from a complete review but I'm running out of steam for today. I hope that it's at least somewhat useful. ...Robert
On Mon, 20 Nov 2023 at 10:00, Amit Langote <amitlangote09@gmail.com> wrote: > > On Thu, Sep 28, 2023 at 5:26 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote: > > > After sleeping on this, I think we do need the checks after all the > > > ExecInitNode() calls too, because we have many instances of the code > > > like the following one: > > > > > > outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags); > > > tupDesc = ExecGetResultType(outerPlanState(gatherstate)); > > > <some code that dereferences outDesc> > > > > > > If outerNode is a SeqScan and ExecInitSeqScan() returned early because > > > ExecOpenScanRelation() detected that plan was invalidated, then > > > tupDesc would be NULL in this case, causing the code to crash. > > > > > > Now one might say that perhaps we should only add the > > > is-CachedPlan-valid test in the instances where there is an actual > > > risk of such misbehavior, but that could lead to confusion, now or > > > later. It seems better to add them after every ExecInitNode() call > > > while we're inventing the notion, because doing so relieves the > > > authors of future enhancements of the ExecInit*() routines from > > > worrying about any of this. > > > > > > Attached 0003 should show how that turned out. > > > > > > Updated 0002 as mentioned in the previous reply -- setting pointers to > > > NULL after freeing them more consistently across various ExecEnd*() > > > routines and using the `if (pointer != NULL)` style over the `if > > > (pointer)` more consistently. > > > > > > Updated 0001's commit message to remove the mention of its relation to > > > any future commits. I intend to push it tomorrow. > > > > Pushed that one. Here are the rebased patches. > > > > 0001 seems ready to me, but I'll wait a couple more days for others to > > weigh in. Just to highlight a kind of change that others may have > > differing opinions on, consider this hunk from the patch: > > > > - MemoryContextDelete(node->aggcontext); > > + if (node->aggcontext != NULL) > > + { > > + MemoryContextDelete(node->aggcontext); > > + node->aggcontext = NULL; > > + } > > ... > > + ExecEndNode(outerPlanState(node)); > > + outerPlanState(node) = NULL; > > > > So the patch wants to enhance the consistency of setting the pointer > > to NULL after freeing part. Robert mentioned his preference for doing > > it in the patch, which I agree with. > > Rebased. There is a leak reported at [1], details for the same is available at [2]: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out --- /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out 2023-12-19 23:00:04.677385000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out 2023-12-19 23:06:26.870259000 +0000 @@ -1288,6 +1288,7 @@ (102, '2011-10-12', 120), (102, '2011-10-28', 200), (103, '2011-10-15', 480); +WARNING: resource was not closed: relation "customer_pkey" CREATE VIEW my_property_normal AS SELECT * FROM customer WHERE name = current_user; CREATE VIEW my_property_secure WITH (security_barrier) A [1] - https://cirrus-ci.com/task/6494009196019712 [2] - https://api.cirrus-ci.com/v1/artifact/task/6494009196019712/testrun/build/testrun/regress-running/regress/regression.diffs Regards, Vingesh
> On 6 Dec 2023, at 23:52, Robert Haas <robertmhaas@gmail.com> wrote: > > I hope that it's at least somewhat useful. > > On 5 Jan 2024, at 15:46, vignesh C <vignesh21@gmail.com> wrote: > > There is a leak reported Hi Amit, this is a kind reminder that some feedback on your patch[0] is waiting for your reply. Thank you for your work! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/47/3478/
Hi Andrey, On Sun, Mar 31, 2024 at 2:03 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > On 6 Dec 2023, at 23:52, Robert Haas <robertmhaas@gmail.com> wrote: > > > > I hope that it's at least somewhat useful. > > > On 5 Jan 2024, at 15:46, vignesh C <vignesh21@gmail.com> wrote: > > > > There is a leak reported > > Hi Amit, > > this is a kind reminder that some feedback on your patch[0] is waiting for your reply. > Thank you for your work! Thanks for moving this to the next CF. My apologies (especially to Robert) for not replying on this thread for a long time. I plan to start working on this soon. -- Thanks, Amit Langote
On Fri, 20 Jan 2023 at 08:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I spent some time re-reading this whole thread, and the more I read > the less happy I got. We are adding a lot of complexity and introducing > coding hazards that will surely bite somebody someday. And after awhile > I had what felt like an epiphany: the whole problem arises because the > system is wrongly factored. We should get rid of AcquireExecutorLocks > altogether, allowing the plancache to hand back a generic plan that > it's not certain of the validity of, and instead integrate the > responsibility for acquiring locks into executor startup. It'd have > to be optional there, since we don't need new locks in the case of > executing a just-planned plan; but we can easily add another eflags > bit (EXEC_FLAG_GET_LOCKS or so). Then there has to be a convention > whereby the ExecInitNode traversal can return an indicator that > "we failed because the plan is stale, please make a new plan". I also reread the entire thread up to this point yesterday. I've also been thinking about this recently as Amit has mentioned it to me a few times over the past few months. With the caveat of not yet having looked at the latest patch, my thoughts are that having the executor startup responsible for taking locks is a bad idea and I don't think we should go down this path. My reasons are: 1. No ability to control the order that the locks are obtained. The order in which the locks are taken will be at the mercy of the plan the planner chooses. 2. It introduces lots of complexity regarding how to cleanly clean up after a failed executor startup which is likely to make exec startup slower and the code more complex 3. It puts us even further down the path of actually needing an executor startup phase. For #1, the locks taken for SELECT queries are less likely to conflict with other locks obtained by PostgreSQL, but at least at the moment if someone is getting deadlocks with a DDL type operation, they can change their query or DDL script so that locks are taken in the same order. If we allowed executor startup to do this then if someone comes complaining that PG18 deadlocks when PG17 didn't we'd just have to tell them to live with it. There's a comment at the bottom of find_inheritance_children_extended() just above the qsort() which explains about the deadlocking issue. I don't have much extra to say about #2. As mentioned, I've not looked at the patch. On paper, it sounds possible, but it also sounds bug-prone and ugly. For #3, I've been thinking about what improvements we can do to make the executor more efficient. In [1], Andres talks about some very interesting things. In particular, in his email items 3) and 5) are relevant here. If we did move lots of executor startup code into the planner, I think it would be possible to one day get rid of executor startup and have the plan record how much memory is needed for the non-readonly part of the executor state and tag each plan node with the offset in bytes they should use for their portion of the executor working state. This would be a single memory allocation for the entire plan. The exact details are not important here, but I feel like if we load up executor startup with more responsibilities, it'll just make doing something like this harder. The init run-time pruning code that I worked on likely already has done that, but I don't think it's closed the door on it as it might just mean allocating more executor state memory than we need to. Providing the plan node records the offset into that memory, I think it could be made to work, just with the inefficiency of having a (possibly) large unused hole in that state memory. As far as I understand it, your objection to the original proposal is just on the grounds of concerns about introducing hazards that could turn into bugs. I think we could come up with some way to make the prior method of doing pruning before executor startup work. I think what Amit had before your objection was starting to turn into something workable and we should switch back to working on that. David [1] https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de
David Rowley <dgrowleyml@gmail.com> writes: > With the caveat of not yet having looked at the latest patch, my > thoughts are that having the executor startup responsible for taking > locks is a bad idea and I don't think we should go down this path. OK, it's certainly still up for argument, but ... > 1. No ability to control the order that the locks are obtained. The > order in which the locks are taken will be at the mercy of the plan > the planner chooses. I do not think I buy this argument, because plancache.c doesn't provide any "ability to control the order" today, and never has. The order in which AcquireExecutorLocks re-gets relation locks is only weakly related to the order in which the parser/planner got them originally. The order in which AcquirePlannerLocks re-gets the locks is even less related to the original. This doesn't cause any big problems that I'm aware of, because these locks are fairly weak. I think we do have a guarantee that for partitioned tables, parents will be locked before children, and that's probably valuable. But an executor-driven lock order could preserve that property too. > 2. It introduces lots of complexity regarding how to cleanly clean up > after a failed executor startup which is likely to make exec startup > slower and the code more complex Perhaps true, I'm not sure. But the patch we'd been discussing before this proposal was darn complex as well. > 3. It puts us even further down the path of actually needing an > executor startup phase. Huh? We have such a thing already. > For #1, the locks taken for SELECT queries are less likely to conflict > with other locks obtained by PostgreSQL, but at least at the moment if > someone is getting deadlocks with a DDL type operation, they can > change their query or DDL script so that locks are taken in the same > order. If we allowed executor startup to do this then if someone > comes complaining that PG18 deadlocks when PG17 didn't we'd just have > to tell them to live with it. There's a comment at the bottom of > find_inheritance_children_extended() just above the qsort() which > explains about the deadlocking issue. The reason it's important there is that function is (sometimes) used for lock modes that *are* exclusive. > For #3, I've been thinking about what improvements we can do to make > the executor more efficient. In [1], Andres talks about some very > interesting things. In particular, in his email items 3) and 5) are > relevant here. If we did move lots of executor startup code into the > planner, I think it would be possible to one day get rid of executor > startup and have the plan record how much memory is needed for the > non-readonly part of the executor state and tag each plan node with > the offset in bytes they should use for their portion of the executor > working state. I'm fairly skeptical about that idea. The entire reason we have an issue here is that we want to do runtime partition pruning, which by definition can't be done at plan time. So I doubt it's going to play nice with what we are trying to accomplish in this thread. Moreover, while "replace a bunch of small pallocs with one big one" would save some palloc effort, what are you going to do to ensure that that memory has the right initial contents? I think this idea is likely to make the executor a great deal more notationally complex without actually buying all that much. Maybe Andres can make it work, but I don't want to contort other parts of the system design on the purely hypothetical basis that this might happen. > I think what Amit had before your objection was starting to turn into > something workable and we should switch back to working on that. The reason I posted this idea was that I didn't think the previously existing patch looked promising at all. regards, tom lane
On Sun, 19 May 2024 at 13:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > 1. No ability to control the order that the locks are obtained. The > > order in which the locks are taken will be at the mercy of the plan > > the planner chooses. > > I do not think I buy this argument, because plancache.c doesn't > provide any "ability to control the order" today, and never has. > The order in which AcquireExecutorLocks re-gets relation locks is only > weakly related to the order in which the parser/planner got them > originally. The order in which AcquirePlannerLocks re-gets the locks > is even less related to the original. This doesn't cause any big > problems that I'm aware of, because these locks are fairly weak. It may not bite many people, it's just that if it does, I don't see what we could do to help those people. At the moment we could tell them to adjust their DDL script to obtain the locks in the same order as their query. With your idea that cannot be done as the order could change when the planner switches the join order. > I think we do have a guarantee that for partitioned tables, parents > will be locked before children, and that's probably valuable. > But an executor-driven lock order could preserve that property too. I think you'd have to lock the parent before the child. That would remain true and consistent anyway when taking locks during a breadth-first plan traversal. > > For #3, I've been thinking about what improvements we can do to make > > the executor more efficient. In [1], Andres talks about some very > > interesting things. In particular, in his email items 3) and 5) are > > relevant here. If we did move lots of executor startup code into the > > planner, I think it would be possible to one day get rid of executor > > startup and have the plan record how much memory is needed for the > > non-readonly part of the executor state and tag each plan node with > > the offset in bytes they should use for their portion of the executor > > working state. > > I'm fairly skeptical about that idea. The entire reason we have an > issue here is that we want to do runtime partition pruning, which > by definition can't be done at plan time. So I doubt it's going > to play nice with what we are trying to accomplish in this thread. I think we could have both, providing there was a way to still traverse the executor state tree in EXPLAIN. We'd need a way to skip portions of the plan that are not relevant or could be invalid for the current execution. e.g can't show Index Scan because index has been dropped. > > I think what Amit had before your objection was starting to turn into > > something workable and we should switch back to working on that. > > The reason I posted this idea was that I didn't think the previously > existing patch looked promising at all. Ok. It would be good if you could expand on that so we could determine if there's some fundamental reason it can't work or if that's because you were blinded by your epiphany and didn't give that any thought after thinking of the alternative idea. I've gone to effort to point out things that I think are concerning with your idea. It would be good if you could do the same for the previous patch other than "it didn't look promising". It's pretty hard for me to argue with that level of detail. David
On Sun, May 19, 2024 at 9:39 AM David Rowley <dgrowleyml@gmail.com> wrote: > For #1, the locks taken for SELECT queries are less likely to conflict > with other locks obtained by PostgreSQL, but at least at the moment if > someone is getting deadlocks with a DDL type operation, they can > change their query or DDL script so that locks are taken in the same > order. If we allowed executor startup to do this then if someone > comes complaining that PG18 deadlocks when PG17 didn't we'd just have > to tell them to live with it. There's a comment at the bottom of > find_inheritance_children_extended() just above the qsort() which > explains about the deadlocking issue. Thought to chime in on this. A deadlock may occur with the execution-time locking proposed in the patch if the DDL script makes assumptions about how a cached plan's execution determines the locking order for children of multiple parent relations. Specifically, the deadlock can happen if the script tries to lock the child relations directly, instead of locking them through their respective parent relations. The patch doesn't change the order of locking of relations mentioned in the query, because that's defined in AcquirePlannerLocks(). -- Thanks, Amit Langote
I had occasion to run the same benchmark you described in the initial email in this thread. To do so I applied patch series v49 on top of 07cb29737a4e, which is just one that happened to have the same date as v49. I then used a script like this (against a server having plan_cache_mode=force_generic_mode) for numparts in 0 1 2 4 8 16 32 48 64 80 81 96 127 128 160 200 256 257 288 300 384 512 1024 1536 2048; do pgbench testdb -i --partitions=$numparts 2>/dev/null echo -ne "$numparts\t" pgbench -n testdb -S -T30 -Mprepared | grep "^tps" | sed -e 's/^tps = \([0-9.]*\) .*/\1/' done and did the same with the commit mentioned above (that is, unpatched). I got this table as result partitions │ patched │ 07cb29737a ────────────┼──────────────┼────────────── 0 │ 65632.090431 │ 68967.712741 1 │ 68096.641831 │ 65356.587223 2 │ 59456.507575 │ 60884.679464 4 │ 62097.426 │ 59698.747104 8 │ 58044.311175 │ 57817.104562 16 │ 59741.926563 │ 52549.916262 32 │ 59261.693449 │ 44815.317215 48 │ 59047.125629 │ 38362.123652 64 │ 59748.738797 │ 34051.158525 80 │ 59276.839183 │ 32026.135076 81 │ 62318.572932 │ 30418.122933 96 │ 59678.857163 │ 28478.113651 127 │ 58761.960028 │ 24272.303742 128 │ 59934.268306 │ 24275.214593 160 │ 56688.790899 │ 21119.043564 200 │ 56323.188599 │ 18111.212849 256 │ 55915.22466 │ 14753.953709 257 │ 57810.530461 │ 15093.497575 288 │ 56874.780092 │ 13873.332162 300 │ 57222.056549 │ 13463.768946 384 │ 54073.77295 │ 11183.558339 512 │ 37503.766847 │ 8114.32532 1024 │ 42746.866448 │ 4468.41359 1536 │ 39500.58411 │ 3049.984599 2048 │ 36988.519486 │ 2269.362006 where already at 16 partitions we can see that things are going downhill with the unpatched code. (However, what happens when the table is not partitioned looks a bit funny.) I hope we can get this new executor code in 18. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
On Thu, Jun 20, 2024 at 2:09 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I hope we can get this new executor code in 18. Thanks for doing the benchmark, Alvaro, and sorry for the late reply. Yes, I'm hoping to get *some* version of this into v18. I've been thinking how to move this forward and I'm starting to think that we should go back to or at least consider as an option the old approach of changing the plancache to do the initial runtime pruning instead of changing the executor to take locks, which is the design that the latest patch set tries to implement. Here are the challenges facing the implementation of the current design: 1. I went through many iterations of the changes to ExecInitNode() to return a partially initialized PlanState tree when it detects that the CachedPlan was invalidated after locking a child table and to ExecEndNode() to account for the PlanState tree sometimes being partially initialized, but it still seems fragile and bug-prone to me. It might be because this approach is fundamentally hard to get right or I haven't invested enough effort in becoming more confident in its robustness. 2. Refactoring needed due to the ExecutorStart() API change especially that pertaining to portals does not seem airtight. I'm especially worried about moving the ExecutorStart() call for the PORTAL_MULTI_QUERY case from where it is currently to PortalStart(). That requires additional bookkeeping in PortalData and I am not totally sure that the snapshot handling changes after that move are entirely correct. 3. The need to add *back* the fields to store the RT indexes of relations that are not looked at by ExecInitNode() traversal such as root partitioned tables and non-leaf partitions. I'm worried about #2 the most. One complaint about the previous design was that the interface changes to capture and pass the result of doing initial pruning in plancache.c to the executor did not look great. However, after having tried doing #2, the changes to pass the pruning result into the executor and changes to reuse it in ExecInit[Merge]Append() seem a tad bit simpler than the refactoring and adjustments needed to handle failed ExecutorStart() calls, at multiple code sites. About #1, I tend to agree with David that adding complexity around PlanState tree construction may not be a good idea, because we might want to rethink Plan initialization code and data structures in the not too distant future. One idea I thought of is to take the remaining locks (to wit, those on inheritance children if running a cached plan) at the beginning of InitPlan(), that is before ExecInitNode(), like we handle the permission checking, so that we don't need to worry about ever returning a partially initialized PlanState tree. However, we're still left with the tall task to implement #2 such that it doesn't break anything. Another concern about the old design was the unnecessary overhead of initializing bitmapset fields in PlannedStmt that are meant for the locking algorithm in AcquireExecutorLocks(). Andres suggested an idea offlist to either piggyback on cursorOptions argument of pg_plan_queries() or adding a new boolean parameter to let the planner know if the plan is one that might get cached and thus have AcquireExecutorLocks() called on it. Another idea David and I discussed offlist is inventing a RTELockInfo (cf RTEPermissionInfo) and only creating one for each RT entry that is un-prunable and do away with PlannedStmt.rtable. For partitioned tables, that entry will point to the PartitionPruneInfo that will contain the RT indexes of partitions (or maybe just OIDs) mapped from their subplan indexes that are returned by the pruning code. So AcquireExecutorLocks() will lock all un-prunable relations by referring to their RTELockInfo entries and for each entry that points to a PartitionPruneInfo with initial pruning steps, will only lock the partitions that survive the pruning. I am planning to polish that old patch set and post after playing with those new ideas. -- Thanks, Amit Langote
On Mon, Aug 12, 2024 at 8:54 AM Amit Langote <amitlangote09@gmail.com> wrote: > 1. I went through many iterations of the changes to ExecInitNode() to > return a partially initialized PlanState tree when it detects that the > CachedPlan was invalidated after locking a child table and to > ExecEndNode() to account for the PlanState tree sometimes being > partially initialized, but it still seems fragile and bug-prone to me. > It might be because this approach is fundamentally hard to get right > or I haven't invested enough effort in becoming more confident in its > robustness. Can you give some examples of what's going wrong, or what you think might go wrong? I didn't think there was a huge problem here based on previous discussion, but I could very well be missing some important challenge. > 2. Refactoring needed due to the ExecutorStart() API change especially > that pertaining to portals does not seem airtight. I'm especially > worried about moving the ExecutorStart() call for the > PORTAL_MULTI_QUERY case from where it is currently to PortalStart(). > That requires additional bookkeeping in PortalData and I am not > totally sure that the snapshot handling changes after that move are > entirely correct. Here again, it would help to see exactly what you had to do and what consequences you think it might have. But it sounds like you're talking about moving ExecutorStart() from PortalStart() to PortalRun() and I agree that sounds like it might have user-visible behavioral consequences that we don't want. > 3. The need to add *back* the fields to store the RT indexes of > relations that are not looked at by ExecInitNode() traversal such as > root partitioned tables and non-leaf partitions. I don't remember exactly why we removed those or what the benefit was, so I'm not sure how big of a problem it is if we have to put them back. > About #1, I tend to agree with David that adding complexity around > PlanState tree construction may not be a good idea, because we might > want to rethink Plan initialization code and data structures in the > not too distant future. Like Tom, I don't really buy this. There might be a good reason not to do this in ExecutorStart(), but the hypothetical possibility that we might want to change something and that this patch might make it harder is not it. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 15, 2024 at 8:57 AM Amit Langote <amitlangote09@gmail.com> wrote: > TBH, it's more of a hunch that people who are not involved in this > development might find the new reality, whereby the execution is not > racefree until ExecutorRun(), hard to reason about. I'm confused by what you mean here by "racefree". A race means multiple sessions are doing stuff at the same time and the result depends on who does what first, but the executor stuff is all backend-private. Heavyweight locks are not backend-private, but those would be taken in ExectorStart(), not ExecutorRun(), IIUC. > With the patch, CreateQueryDesc() and ExecutorStart() are moved to > PortalStart() so that QueryDescs including the PlanState trees for all > queries are built before any is run. Why? So that if ExecutorStart() > fails for any query in the list, we can simply throw out the QueryDesc > and the PlanState trees of the previous queries (NOT run them) and ask > plancache for a new CachedPlan for the list of queries. We don't have > a way to ask plancache.c to replan only a given query in the list. I agree that moving this from PortalRun() to PortalStart() seems like a bad idea, especially in view of what you write below. > * There's no longer CCI() between queries in PortalRunMulti() because > the snapshots in each query's QueryDesc must have been adjusted to > reflect the correct command counter. I've checked but can't really be > sure if the value in the snapshot is all anyone ever uses if they want > to know the current value of the command counter. I don't think anything stops somebody wanting to look at the current value of the command counter. I also don't think you can remove the CommandCounterIncrement() calls between successive queries, because then they won't see the effects of earlier calls. So this sounds broken to me. Also keep in mind that one of the queries could call a function which does something that bumps the command counter again. I'm not sure if that creates its own hazzard separate from the lack of CCIs, or whether it's just another part of that same issue. But you can't assume that each query's snapshot should have a command counter value one more than the previous query. While this all seems bad for the partially-initialized-execution-tree approach, I wonder if you don't have problems here with the other design, too. Let's say you've the multi-query case and there are 2 queries. The first one (Q1) is SELECT mysterious_function() and the second one (Q2) is SELECT * FROM range_partitioned_table WHERE key_column = 42. What if mysterious_function() performs DDL on range_partitioned_table? I haven't tested this so maybe there are things going on here that prevent trouble, but it seems like executing Q1 can easily invalidate the plan for Q2. And then it seems like you're basically back to the same problem. > > > 3. The need to add *back* the fields to store the RT indexes of > > > relations that are not looked at by ExecInitNode() traversal such as > > > root partitioned tables and non-leaf partitions. > > > > I don't remember exactly why we removed those or what the benefit was, > > so I'm not sure how big of a problem it is if we have to put them > > back. > > We removed those in commit 52ed730d511b after commit f2343653f5b2 > removed redundant execution-time locking of non-leaf relations. So we > removed them because we realized that execution time locking is > unnecessary given that AcquireExecutorLocks() exists and now we want > to add them back because we'd like to get rid of > AcquireExecutorLocks(). :-) My bias is to believe that getting rid of AcquireExecutorLocks() is probably the right thing to do, but that's not a strongly-held position and I could be totally wrong about it. The thing is, though, that AcquireExecutorLocks() is fundamentally stupid, and it's hard to see how it can ever be any smarter. If we want to make smarter decisions about what to lock, it seems reasonable to me to think that the locking code needs to be closer to code that can evaluate expressions and prune partitions and stuff like that. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 16, 2024 at 12:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 15, 2024 at 8:57 AM Amit Langote <amitlangote09@gmail.com> wrote: > > TBH, it's more of a hunch that people who are not involved in this > > development might find the new reality, whereby the execution is not > > racefree until ExecutorRun(), hard to reason about. > > I'm confused by what you mean here by "racefree". A race means > multiple sessions are doing stuff at the same time and the result > depends on who does what first, but the executor stuff is all > backend-private. Heavyweight locks are not backend-private, but those > would be taken in ExectorStart(), not ExecutorRun(), IIUC. Sorry, yes, I meant ExecutorStart(). A backend that wants to execute a plan tree from a CachedPlan is in a race with other backends that might modify tables before ExecutorStart() takes the remaining locks. That race window is bigger when it is ExecutorStart() that will take the locks, and I don't mean in terms of timing, but in terms of the other code that can run in between GetCachedPlan() returning a partially valid plan and ExecutorStart() takes the remaining locks depending on the calling module. > > With the patch, CreateQueryDesc() and ExecutorStart() are moved to > > PortalStart() so that QueryDescs including the PlanState trees for all > > queries are built before any is run. Why? So that if ExecutorStart() > > fails for any query in the list, we can simply throw out the QueryDesc > > and the PlanState trees of the previous queries (NOT run them) and ask > > plancache for a new CachedPlan for the list of queries. We don't have > > a way to ask plancache.c to replan only a given query in the list. > > I agree that moving this from PortalRun() to PortalStart() seems like > a bad idea, especially in view of what you write below. > > > * There's no longer CCI() between queries in PortalRunMulti() because > > the snapshots in each query's QueryDesc must have been adjusted to > > reflect the correct command counter. I've checked but can't really be > > sure if the value in the snapshot is all anyone ever uses if they want > > to know the current value of the command counter. > > I don't think anything stops somebody wanting to look at the current > value of the command counter. I also don't think you can remove the > CommandCounterIncrement() calls between successive queries, because > then they won't see the effects of earlier calls. So this sounds > broken to me. I suppose you mean CCI between "running" (calling ExecutorRun on) successive queries. Then the patch is indeed broken. If we're to make that right, the number of CCIs for the multi-query portals will have to double given the separation of ExecutorStart() and ExecutorRun() phases. > Also keep in mind that one of the queries could call a function which > does something that bumps the command counter again. I'm not sure if > that creates its own hazzard separate from the lack of CCIs, or > whether it's just another part of that same issue. But you can't > assume that each query's snapshot should have a command counter value > one more than the previous query. > > While this all seems bad for the partially-initialized-execution-tree > approach, I wonder if you don't have problems here with the other > design, too. Let's say you've the multi-query case and there are 2 > queries. The first one (Q1) is SELECT mysterious_function() and the > second one (Q2) is SELECT * FROM range_partitioned_table WHERE > key_column = 42. What if mysterious_function() performs DDL on > range_partitioned_table? I haven't tested this so maybe there are > things going on here that prevent trouble, but it seems like executing > Q1 can easily invalidate the plan for Q2. And then it seems like > you're basically back to the same problem. A rule (but not views AFAICS) can lead to the multi-query case (there might be other ways). I tried the following, and, yes, the plan for the query queued by the rule is broken by the execution of that for the 1st query: create table foo (a int); create table bar (a int); create or replace function foo_trig_func () returns trigger as $$ begin drop table bar cascade; return new.*; end; $$ language plpgsql; create trigger foo_trig before insert on foo execute function foo_trig_func(); create rule insert_foo AS ON insert TO foo do also insert into bar values (new.*); set plan_cache_mode to force_generic_plan ; prepare q as insert into foo values (1); execute q; NOTICE: drop cascades to rule insert_foo on table foo ERROR: relation with OID 16418 does not exist The ERROR comes from trying to run (actually "initialize") the cached plan for `insert into bar values (new.*);` which is due to the rule. Though, it doesn't have to be a cached plan for the breakage to happen. You can see the same error without the prepared statement: insert into foo values (1); NOTICE: drop cascades to rule insert_foo on table foo ERROR: relation with OID 16418 does not exist Another example: create or replace function foo_trig_func () returns trigger as $$ begin alter table bar add b int; return new.*; end; $$ language plpgsql; execute q; ERROR: table row type and query-specified row type do not match DETAIL: Query has too few columns. insert into foo values (1); ERROR: table row type and query-specified row type do not match DETAIL: Query has too few columns. This time the error occurs in ExecModifyTable(), so when "running" the plan, but again the code that's throwing the error is just "lazy" initialization of the ProjectionInfo when inserting into bar. So it is possible for the executor to try to run a plan that has become invalid since it was created, so... > > > > 3. The need to add *back* the fields to store the RT indexes of > > > > relations that are not looked at by ExecInitNode() traversal such as > > > > root partitioned tables and non-leaf partitions. > > > > > > I don't remember exactly why we removed those or what the benefit was, > > > so I'm not sure how big of a problem it is if we have to put them > > > back. > > > > We removed those in commit 52ed730d511b after commit f2343653f5b2 > > removed redundant execution-time locking of non-leaf relations. So we > > removed them because we realized that execution time locking is > > unnecessary given that AcquireExecutorLocks() exists and now we want > > to add them back because we'd like to get rid of > > AcquireExecutorLocks(). :-) > > My bias is to believe that getting rid of AcquireExecutorLocks() is > probably the right thing to do, but that's not a strongly-held > position and I could be totally wrong about it. The thing is, though, > that AcquireExecutorLocks() is fundamentally stupid, and it's hard to > see how it can ever be any smarter. If we want to make smarter > decisions about what to lock, it seems reasonable to me to think that > the locking code needs to be closer to code that can evaluate > expressions and prune partitions and stuff like that. One perhaps crazy idea [1]: What if we remove AcquireExecutorLocks() and move the responsibility of taking the remaining necessary locks into the executor (those on any inheritance children that are added during planning and thus not accounted for by AcquirePlannerLocks()), like the patch already does, but don't make it also check if the plan has become invalid, which it can't do anyway unless it's from a CachedPlan. That means we instead let the executor throw any errors that occur when trying to either initialize the plan because of the changes that have occurred to the objects referenced in the plan, like what is happening in the above example. If that case is going to be rare anway, why spend energy on checking the validity and replan, especially if that's not an easy thing to do as we're finding out. In the above example, we could say that it's a user error to create a rule like that, so it should not happen in practice, but when it does, the executor seems to deal with it correctly by refusing to execute a broken plan . Perhaps it's more worthwhile to make the executor behave correctly in face of plan invalidation than teach the rest of the system to deal with the executor throwing its hands up when it runs into an invalid plan? Again, I think this may be a crazy line of thinking but just wanted to get it out there. -- Thanks, Amit Langote [1] I recall Michael Paquier mentioning something like this to me once when I was describing this patch and thread to him.
On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote: > So it is possible for the executor to try to run a plan that has > become invalid since it was created, so... I'm not sure what the "so what" here is. > One perhaps crazy idea [1]: > > What if we remove AcquireExecutorLocks() and move the responsibility > of taking the remaining necessary locks into the executor (those on > any inheritance children that are added during planning and thus not > accounted for by AcquirePlannerLocks()), like the patch already does, > but don't make it also check if the plan has become invalid, which it > can't do anyway unless it's from a CachedPlan. That means we instead > let the executor throw any errors that occur when trying to either > initialize the plan because of the changes that have occurred to the > objects referenced in the plan, like what is happening in the above > example. If that case is going to be rare anway, why spend energy on > checking the validity and replan, especially if that's not an easy > thing to do as we're finding out. In the above example, we could say > that it's a user error to create a rule like that, so it should not > happen in practice, but when it does, the executor seems to deal with > it correctly by refusing to execute a broken plan . Perhaps it's more > worthwhile to make the executor behave correctly in face of plan > invalidation than teach the rest of the system to deal with the > executor throwing its hands up when it runs into an invalid plan? > Again, I think this may be a crazy line of thinking but just wanted to > get it out there. I don't know whether this is crazy or not. I think there are two issues. One, the set of checks that we have right now might not be complete, and we might just not have realized that because it happens infrequently enough that we haven't found all the bugs. If that's so, then a change like this could be a good thing, because it might force us to fix stuff we should be fixing anyway. I have a feeling that some of the checks you hit there were added as bug fixes long after the code was written originally, so my confidence that we don't have more bugs isn't especially high. And two, it matters a lot how frequent the errors will be in practice. I think we normally try to replan rather than let a stale plan be used because we want to not fail, because users don't like failure. If the design you propose here would make failures more (or less) frequent, then that's a problem (or awesome). -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote: >> So it is possible for the executor to try to run a plan that has >> become invalid since it was created, so... > I'm not sure what the "so what" here is. The fact that there are holes in our protections against that doesn't make it a good idea to walk away from the protections. That path leads to crashes and data corruption and unhappy users. What the examples here are showing is that AcquireExecutorLocks is incomplete because it only provides defenses against DDL initiated by other sessions, not by our own session. We have CheckTableNotInUse but I'm not sure if it could be applied here. We certainly aren't calling that in anywhere near as systematic a way as we have for acquiring locks. Maybe we should rethink the principle that a session's locks never conflict against itself, although I fear that might be a nasty can of worms. Could it work to do CheckTableNotInUse when acquiring an exclusive table lock? I don't doubt that we'd have to fix some code paths, but if the damage isn't extensive then that might offer a more nearly bulletproof approach. regards, tom lane
On Mon, Aug 19, 2024 at 12:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > What the examples here are showing is that AcquireExecutorLocks > is incomplete because it only provides defenses against DDL > initiated by other sessions, not by our own session. We have > CheckTableNotInUse but I'm not sure if it could be applied here. > We certainly aren't calling that in anywhere near as systematic > a way as we have for acquiring locks. > > Maybe we should rethink the principle that a session's locks > never conflict against itself, although I fear that might be > a nasty can of worms. It might not be that bad. It could replace the CheckTableNotInUse() protections that we have today but maybe cover more cases, and it could do so without needing any changes to the shared lock manager. Say every time you start a query you give that query an ID number, and all locks taken by that query are tagged with that ID number in the local lock table, and maybe some flags indicating why the lock was taken. When a new lock acquisition comes along you can say "oh, this lock was previously taken so that we could do thus-and-so" and then use that to fail with the appropriate error message. That seems like it might be more powerful than the refcnt check within CheckTableNotInUse(). But that seems somewhat incidental to what this thread is about. IIUC, Amit's original design involved having the plan cache call some new executor function to do partition pruning before lock acquisition, and then passing that data structure around, including back to the executor, so that we didn't repeat the pruning we already did, which would be a bad thing to do not only because it would incur CPU cost but also because really bad things would happen if we got a different answer the second time. IIUC, you didn't think that was going to work out nicely, and suggested instead moving the pruning+locking to ExecutorStart() time. But now Amit is finding problems with that approach, because by the time we reach PortalRun() for the PORTAL_MULTI_QUERY case, it's too late to replan, because we can't ask the plancache to replan just one query from the list; and if we try to fix that by moving ExecutorStart() to PortalStart(), then there are other problems. Do you have a view on what the way forward might be? This thread has gotten a tad depressing, honestly. All of the opinions about what we ought to do seem to be based on the firm conviction that X or Y or Z will not work, rather than on the confidence that A or B or C will work. Yet I'm inclined to believe this problem is solvable. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > But that seems somewhat incidental to what this thread is about. Perhaps. But if we're running into issues related to that, it might be good to set aside the long-term goal for a bit and come up with a cleaner answer for intra-session locking. That could allow the pruning problem to be solved more cleanly in turn, and it'd be an improvement even if not. > Do you have a view on what the way forward might be? I'm fresh out of ideas at the moment, other than having a hope that divide-and-conquer (ie, solving subproblems first) might pay off. > This thread has gotten a tad depressing, honestly. All of the opinions > about what we ought to do seem to be based on the firm conviction that > X or Y or Z will not work, rather than on the confidence that A or B > or C will work. Yet I'm inclined to believe this problem is solvable. Yeah. We are working in an extremely not-green field here, which means it's a lot easier to see pre-existing reasons why X will not work than to have confidence that it will work. But hey, if this were easy then we'd have done it already. regards, tom lane
On Mon, Aug 19, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > But that seems somewhat incidental to what this thread is about. > > Perhaps. But if we're running into issues related to that, it might > be good to set aside the long-term goal for a bit and come up with > a cleaner answer for intra-session locking. That could allow the > pruning problem to be solved more cleanly in turn, and it'd be > an improvement even if not. Maybe, but the pieces aren't quite coming together for me. Solving this would mean that if we execute a stale plan, we'd be more likely to get a good error and less likely to get a bad, nasty-looking internal error, or a crash. That's good on its own terms, but we don't really want user queries to produce errors at all, so I don't think we'd feel any more free to rearrange the order of operations than we do today. > > Do you have a view on what the way forward might be? > > I'm fresh out of ideas at the moment, other than having a hope that > divide-and-conquer (ie, solving subproblems first) might pay off. Fair enough, but why do you think that the original approach of creating a data structure from within the plan cache mechanism (probably via a call into some new executor entrypoint) and then feeding that through to ExecutorRun() time can't work? Is it possible you latched onto some non-optimal decisions that the early versions of the patch made, rather than there being a fundamental problem with the concept? I actually thought the do-it-at-executorstart-time approach sounded pretty good, even though we might have to abandon planstate tree initialization partway through, right up until Amit started talking about moving ExecutorStart() from PortalRun() to PortalStart(), which I have a feeling is going to create a bigger problem than we can solve. I think if we want to save that approach, we should try to figure out if we can teach the plancache to replan one query from a list without replanning the others, which seems like it might allow us to keep the order of major operations unchanged. Otherwise, it makes sense to me to have another go at the other approach, at least to make sure we understand clearly why it can't work. > Yeah. We are working in an extremely not-green field here, which > means it's a lot easier to see pre-existing reasons why X will not > work than to have confidence that it will work. But hey, if this > were easy then we'd have done it already. Yeah, true. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 20, 2024 at 1:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote: > > So it is possible for the executor to try to run a plan that has > > become invalid since it was created, so... > > I'm not sure what the "so what" here is. I meant that if the executor has to deal with broken plans anyway, we might as well lean into that fact by choosing not to handle only the cached plan case in a certain way. Yes, I understand that that's not a good justification. > > One perhaps crazy idea [1]: > > > > What if we remove AcquireExecutorLocks() and move the responsibility > > of taking the remaining necessary locks into the executor (those on > > any inheritance children that are added during planning and thus not > > accounted for by AcquirePlannerLocks()), like the patch already does, > > but don't make it also check if the plan has become invalid, which it > > can't do anyway unless it's from a CachedPlan. That means we instead > > let the executor throw any errors that occur when trying to either > > initialize the plan because of the changes that have occurred to the > > objects referenced in the plan, like what is happening in the above > > example. If that case is going to be rare anway, why spend energy on > > checking the validity and replan, especially if that's not an easy > > thing to do as we're finding out. In the above example, we could say > > that it's a user error to create a rule like that, so it should not > > happen in practice, but when it does, the executor seems to deal with > > it correctly by refusing to execute a broken plan . Perhaps it's more > > worthwhile to make the executor behave correctly in face of plan > > invalidation than teach the rest of the system to deal with the > > executor throwing its hands up when it runs into an invalid plan? > > Again, I think this may be a crazy line of thinking but just wanted to > > get it out there. > > I don't know whether this is crazy or not. I think there are two > issues. One, the set of checks that we have right now might not be > complete, and we might just not have realized that because it happens > infrequently enough that we haven't found all the bugs. If that's so, > then a change like this could be a good thing, because it might force > us to fix stuff we should be fixing anyway. I have a feeling that some > of the checks you hit there were added as bug fixes long after the > code was written originally, so my confidence that we don't have more > bugs isn't especially high. This makes sense. > And two, it matters a lot how frequent the errors will be in practice. > I think we normally try to replan rather than let a stale plan be used > because we want to not fail, because users don't like failure. If the > design you propose here would make failures more (or less) frequent, > then that's a problem (or awesome). I think we'd modify plancache.c to postpone the locking of only prunable relations (i.e., partitions), so we're looking at only a handful of concurrent modifications that are going to cause execution errors. That's because we disallow many DDL modifications of partitions unless they are done via recursion from the parent, so the space of errors in practice would be smaller compared to if we were to postpone *all* cached plan locks to ExecInitNode() time. DROP INDEX a_partion_only_index comes to mind as something that might cause an error. I've not tested if other partition-only constraints can cause unsafe behaviors. Perhaps, we can add the check for CachedPlan.is_valid after every table_open() and index_open() in the executor that takes a lock or at all the places we discussed previously and throw the error (say: "cached plan is no longer valid") if it's false. That's better than running into and throwing into some random error by soldiering ahead with its initialization / execution, but still a loss in terms of user experience because we're adding a new failure mode, however rare. -- Thanks, Amit Langote
On Tue, Aug 20, 2024 at 3:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Aug 19, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > But that seems somewhat incidental to what this thread is about. > > > > Perhaps. But if we're running into issues related to that, it might > > be good to set aside the long-term goal for a bit and come up with > > a cleaner answer for intra-session locking. That could allow the > > pruning problem to be solved more cleanly in turn, and it'd be > > an improvement even if not. > > Maybe, but the pieces aren't quite coming together for me. Solving > this would mean that if we execute a stale plan, we'd be more likely > to get a good error and less likely to get a bad, nasty-looking > internal error, or a crash. That's good on its own terms, but we don't > really want user queries to produce errors at all, so I don't think > we'd feel any more free to rearrange the order of operations than we > do today. Yeah, it's unclear whether executing a potentially stale plan is an acceptable tradeoff compared to replanning, especially if it occurs rarely. Personally, I would prefer that it is. > > > Do you have a view on what the way forward might be? > > > > I'm fresh out of ideas at the moment, other than having a hope that > > divide-and-conquer (ie, solving subproblems first) might pay off. > > Fair enough, but why do you think that the original approach of > creating a data structure from within the plan cache mechanism > (probably via a call into some new executor entrypoint) and then > feeding that through to ExecutorRun() time can't work? That would be ExecutorStart(). The data structure need not be referenced after ExecInitNode(). > Is it possible > you latched onto some non-optimal decisions that the early versions of > the patch made, rather than there being a fundamental problem with the > concept? > > I actually thought the do-it-at-executorstart-time approach sounded > pretty good, even though we might have to abandon planstate tree > initialization partway through, right up until Amit started talking > about moving ExecutorStart() from PortalRun() to PortalStart(), which > I have a feeling is going to create a bigger problem than we can > solve. I think if we want to save that approach, we should try to > figure out if we can teach the plancache to replan one query from a > list without replanning the others, which seems like it might allow us > to keep the order of major operations unchanged. Otherwise, it makes > sense to me to have another go at the other approach, at least to make > sure we understand clearly why it can't work. +1 -- Thanks, Amit Langote
On Tue, Aug 20, 2024 at 9:00 AM Amit Langote <amitlangote09@gmail.com> wrote: > I think we'd modify plancache.c to postpone the locking of only > prunable relations (i.e., partitions), so we're looking at only a > handful of concurrent modifications that are going to cause execution > errors. That's because we disallow many DDL modifications of > partitions unless they are done via recursion from the parent, so the > space of errors in practice would be smaller compared to if we were to > postpone *all* cached plan locks to ExecInitNode() time. DROP INDEX > a_partion_only_index comes to mind as something that might cause an > error. I've not tested if other partition-only constraints can cause > unsafe behaviors. This seems like a valid point to some extent, but in other contexts we've had discussions about how we don't actually guarantee all that much uniformity between a partitioned table and its partitions, and it's been questioned whether we made the right decisions there. So I'm not entirely sure that the surface area for problems here will be as narrow as you're hoping -- I think we'd need to go through all of the ALTER TABLE variants and think it through. But maybe the problems aren't that bad. It does seem like constraints can change the plan. Imagine the partition had a CHECK(false) constraint before and now doesn't, or something. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 20, 2024 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 20, 2024 at 9:00 AM Amit Langote <amitlangote09@gmail.com> wrote: > > I think we'd modify plancache.c to postpone the locking of only > > prunable relations (i.e., partitions), so we're looking at only a > > handful of concurrent modifications that are going to cause execution > > errors. That's because we disallow many DDL modifications of > > partitions unless they are done via recursion from the parent, so the > > space of errors in practice would be smaller compared to if we were to > > postpone *all* cached plan locks to ExecInitNode() time. DROP INDEX > > a_partion_only_index comes to mind as something that might cause an > > error. I've not tested if other partition-only constraints can cause > > unsafe behaviors. > > This seems like a valid point to some extent, but in other contexts > we've had discussions about how we don't actually guarantee all that > much uniformity between a partitioned table and its partitions, and > it's been questioned whether we made the right decisions there. So I'm > not entirely sure that the surface area for problems here will be as > narrow as you're hoping -- I think we'd need to go through all of the > ALTER TABLE variants and think it through. But maybe the problems > aren't that bad. Many changeable properties that are reflected in the RelationData of a partition after getting the lock on it seem to cause no issues as long as the executor code only looks at RelationData, which is true for most Scan nodes. It also seems true for ModifyTable which looks into RelationData for relation properties relevant to insert/deletes. The two things that don't cope are: * Index Scan nodes with concurrent DROP INDEX of partition-only indexes. * Concurrent DROP CONSTRAINT of partition-only CHECK and NOT NULL constraints can lead to incorrect result as I write below. > It does seem like constraints can change the plan. Imagine the > partition had a CHECK(false) constraint before and now doesn't, or > something. Yeah, if the CHECK constraint gets dropped concurrently, any new rows that got added after that will not be returned by executing a stale cached plan, because the plan would have been created based on the assumption that such rows shouldn't be there due to the CHECK constraint. We currently don't explicitly check that the constraints that were used during planning still exist before executing the plan. Overall, I'm starting to feel less enthused by the idea throwing an error in the executor due to known and unknown hazards of trying to execute a stale plan. Even if we made a note in the docs of such hazards, any users who run into these rare errors are likely to head to -bugs or -hackers anyway. Tom said we should perhaps look at the hazards caused by intra-session locking, but we'd still be left with the hazards of missing index and constraints, AFAICS, due to DROP from other sessions. So, the options: * The replanning aspect of the lock-in-the-executor design would be simpler if a CachedPlan contained the plan for a single query rather than a list of queries, as previously mentioned. This is particularly due to the requirements of the PORTAL_MULTI_QUERY case. However, this option might be impractical. * Polish the patch for the old design of doing the initial pruning before AcquireExecutorLocks() and focus on hashing out any bugs and issues of that design. -- Thanks, Amit Langote
On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote: > * The replanning aspect of the lock-in-the-executor design would be > simpler if a CachedPlan contained the plan for a single query rather > than a list of queries, as previously mentioned. This is particularly > due to the requirements of the PORTAL_MULTI_QUERY case. However, this > option might be impractical. It might be, but maybe it would be worth a try? I mean, GetCachedPlan() seems to just call pg_plan_queries() which just loops over the list of query trees and does the same thing for each one. If we wanted to replan a single query, why couldn't we do fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then call pg_plan_queries(fake_querytree_list)? Or something equivalent to that. We could have a new GetCachedSinglePlan(cplan, n) to do this. > * Polish the patch for the old design of doing the initial pruning > before AcquireExecutorLocks() and focus on hashing out any bugs and > issues of that design. That's also an option. It probably has issues too, but I don't know what they are exactly. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 21, 2024 at 10:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote: > > * The replanning aspect of the lock-in-the-executor design would be > > simpler if a CachedPlan contained the plan for a single query rather > > than a list of queries, as previously mentioned. This is particularly > > due to the requirements of the PORTAL_MULTI_QUERY case. However, this > > option might be impractical. > > It might be, but maybe it would be worth a try? I mean, > GetCachedPlan() seems to just call pg_plan_queries() which just loops > over the list of query trees and does the same thing for each one. If > we wanted to replan a single query, why couldn't we do > fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then > call pg_plan_queries(fake_querytree_list)? Or something equivalent to > that. We could have a new GetCachedSinglePlan(cplan, n) to do this. I've been hacking to prototype this, and it's showing promise. It helps make the replan loop at the call sites that start the executor with an invalidatable plan more localized and less prone to action-at-a-distance issues. However, the interface and contract of the new function in my prototype are pretty specialized for the replan loop in this context—meaning it's not as general-purpose as GetCachedPlan(). Essentially, what you get when you call it is a 'throwaway' CachedPlan containing only the plan for the query that failed during ExecutorStart(), not a plan integrated into the original CachedPlanSource's stmt_list. A call site entering the replan loop will retry the execution with that throwaway plan, release it once done, and resume looping over the plans in the original list. The invalid plan that remains in the original list will be discarded and replanned in the next call to GetCachedPlan() using the same CachedPlanSource. While that may sound undesirable, I'm inclined to think it's not something that needs optimization, given that we're expecting this code path to be taken rarely. I'll post a version of a revamped locks-in-the-executor patch set using the above function after debugging some more. -- Thanks, Amit Langote
Hi, On Thu, Aug 29, 2024 at 9:34 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Fri, Aug 23, 2024 at 9:48 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Wed, Aug 21, 2024 at 10:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote: > > > > * The replanning aspect of the lock-in-the-executor design would be > > > > simpler if a CachedPlan contained the plan for a single query rather > > > > than a list of queries, as previously mentioned. This is particularly > > > > due to the requirements of the PORTAL_MULTI_QUERY case. However, this > > > > option might be impractical. > > > > > > It might be, but maybe it would be worth a try? I mean, > > > GetCachedPlan() seems to just call pg_plan_queries() which just loops > > > over the list of query trees and does the same thing for each one. If > > > we wanted to replan a single query, why couldn't we do > > > fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then > > > call pg_plan_queries(fake_querytree_list)? Or something equivalent to > > > that. We could have a new GetCachedSinglePlan(cplan, n) to do this. > > > > I've been hacking to prototype this, and it's showing promise. It > > helps make the replan loop at the call sites that start the executor > > with an invalidatable plan more localized and less prone to > > action-at-a-distance issues. However, the interface and contract of > > the new function in my prototype are pretty specialized for the replan > > loop in this context—meaning it's not as general-purpose as > > GetCachedPlan(). Essentially, what you get when you call it is a > > 'throwaway' CachedPlan containing only the plan for the query that > > failed during ExecutorStart(), not a plan integrated into the original > > CachedPlanSource's stmt_list. A call site entering the replan loop > > will retry the execution with that throwaway plan, release it once > > done, and resume looping over the plans in the original list. The > > invalid plan that remains in the original list will be discarded and > > replanned in the next call to GetCachedPlan() using the same > > CachedPlanSource. While that may sound undesirable, I'm inclined to > > think it's not something that needs optimization, given that we're > > expecting this code path to be taken rarely. > > > > I'll post a version of a revamped locks-in-the-executor patch set > > using the above function after debugging some more. > > Here it is. > > 0001 implements changes to defer the locking of runtime-prunable > relations to the executor. The new design introduces a bitmapset > field in PlannedStmt to distinguish at runtime between relations that > are prunable whose locking can be deferred until ExecInitNode() and > those that are not and must be locked in advance. The set of prunable > relations can be constructed by looking at all the PartitionPruneInfos > in the plan and checking which are subject to "initial" pruning steps. > The set of unprunable relations is obtained by subtracting those from > the set of all RT indexes. This design gets rid of one annoying > aspect of the old design which was the need to add specialized fields > to store the RT indexes of partitioned relations that are not > otherwise referenced in the plan tree. That was necessary because in > the old design, I had removed the function AcquireExecutorLocks() > altogether to defer the locking of all child relations to execution. > In the new design such relations are still locked by > AcquireExecutorLocks(). > > 0002 is the old patch to make ExecEndNode() robust against partially > initialized PlanState nodes by adding NULL checks. > > 0003 is the patch to add changes to deal with the CachedPlan becoming > invalid before the deferred locks on prunable relations are taken. > I've moved the replan loop into a new wrapper-over-ExecutorStart() > function instead of having the same logic at multiple sites. The > replan logic uses the GetSingleCachedPlan() described in the quoted > text. The callers of the new ExecutorStart()-wrapper, which I've > dubbed ExecutorStartExt(), need to pass the CachedPlanSource and a > query_index, which is the index of the query being executed in the > list CachedPlanSource.query_list. They are needed by > GetSingleCachedPlan(). The changes outside the executor are pretty > minimal in this design and all the difficulties of having to loop back > to GetCachedPlan() are now gone. I like how this turned out. > > One idea that I think might be worth trying to reduce the footprint of > 0003 is to try to lock the prunable relations in a step of InitPlan() > separate from ExecInitNode(), which can be implemented by doing the > initial runtime pruning in that separate step. That way, we'll have > all the necessary locks before calling ExecInitNode() and so we don't > need to sprinkle the CachedPlanStillValid() checks all over the place > and worry about missed checks and dealing with partially initialized > PlanState trees. > > -- > Thanks, Amit Langote @@ -1241,7 +1244,7 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (customplan) { /* Build a custom plan */ - plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv); + plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv, true); Is the *true* here a typo? Seems it should be *false* for custom plan? -- Regards Junwang Zhao
On Sat, Aug 31, 2024 at 9:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote: > @@ -1241,7 +1244,7 @@ GetCachedPlan(CachedPlanSource *plansource, > ParamListInfo boundParams, > if (customplan) > { > /* Build a custom plan */ > - plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv); > + plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv, true); > > Is the *true* here a typo? Seems it should be *false* for custom plan? That's correct, thanks for catching that. Will fix. -- Thanks, Amit Langote
Hi Amit, This is not a full review (sorry!) but here are a few comments. In general, I don't have a problem with this direction. I thought Tom's previous proposal of abandoning ExecInitNode() in medias res if we discover that we need to replan was doable and I still think that, but ISTM that this approach needs to touch less code, because abandoning ExecInitNode() partly through means we could have leftover state to clean up in any node in the PlanState tree, and as we've discussed, ExecEndNode() isn't necessarily prepared to clean up a PlanState tree that was only partially processed by ExecInitNode(). As far as I can see in the time I've spent looking at this today, 0001 looks pretty unobjectionable (with some exceptions that I've noted below). I also think 0003 looks pretty safe. It seems like partition pruning moves backward across a pretty modest amount of code that does pretty well-defined things. Basically, initialization-time pruning now happens before other types of node initialization, and before setting up row marks. I do however find the changes in 0002 to be less obviously correct and less obviously safe; see below for some notes about that. In 0001, the name root_parent_relids doesn't seem very clear to me, and neither does the explanation of what it does. You say "'root_parent_relids' identifies the relation to which both the parent plan and the PartitionPruneInfo given by 'part_prune_index' belong." But it's a set, so what does it mean to identify "the" relation? It's a set of relations, not just one. And why does the name include the word "root"? It's neither the PlannerGlobal object, which we often call root, nor is it the root of the partitioning hierarchy. To me, it looks like it's just the set of relids that we can potentially prune. I don't see why this isn't just called "relids", like the field from which it's copied: + pruneinfo->root_parent_relids = parentrel->relids; It just doesn't seem very root-y or very parent-y. - node->part_prune_info = partpruneinfo; + Extra blank line. In 0002, the handling of ExprContexts seems a little bit hard to understand. Sometimes we're using the PlanState's ExprContext, and sometimes we're using a separate context owned by the PartitionedRelPruningData's context, and it's not exactly clear why that is or what the consequences are. Likewise I wouldn't mind some more comments or explanation in the commit message of the changes in this patch related to EState objects. I can't help wondering if the changes here could have either semantic implications (like expression evaluation can produce different results than before) or performance implications (because we create objects that we didn't previously create). As noted above, this is really my only design-level concern about 0001-0003. Typo: partrtitioned Regrettably, I have not looked seriously at 0004 and 0005, so I can't comment on those. -- Robert Haas EDB: http://www.enterprisedb.com
Robert, On Fri, Oct 11, 2024 at 5:15 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Hi Amit, > > This is not a full review (sorry!) but here are a few comments. Thank you for taking a look. > In general, I don't have a problem with this direction. I thought > Tom's previous proposal of abandoning ExecInitNode() in medias res if > we discover that we need to replan was doable and I still think that, > but ISTM that this approach needs to touch less code, because > abandoning ExecInitNode() partly through means we could have leftover > state to clean up in any node in the PlanState tree, and as we've > discussed, ExecEndNode() isn't necessarily prepared to clean up a > PlanState tree that was only partially processed by ExecInitNode(). I will say that I feel more comfortable committing and be responsible for the refactoring I'm proposing in 0001-0003 than the changes required to take locks during ExecInitNode(), as seen in the patches up to version v52.. > As > far as I can see in the time I've spent looking at this today, 0001 > looks pretty unobjectionable (with some exceptions that I've noted > below). I also think 0003 looks pretty safe. It seems like partition > pruning moves backward across a pretty modest amount of code that does > pretty well-defined things. Basically, initialization-time pruning now > happens before other types of node initialization, and before setting > up row marks. I do however find the changes in 0002 to be less > obviously correct and less obviously safe; see below for some notes > about that. > > In 0001, the name root_parent_relids doesn't seem very clear to me, > and neither does the explanation of what it does. You say > "'root_parent_relids' identifies the relation to which both the parent > plan and the PartitionPruneInfo given by 'part_prune_index' belong." > But it's a set, so what does it mean to identify "the" relation? It's > a set of relations, not just one. The intention is to ensure that the bitmapset in PartitionPruneInfo corresponds to the apprelids bitmapset in the Append or MergeAppend node that owns the PartitionPruneInfo. Essentially, root_parent_relids is used to cross-check that both sets align, ensuring that the pruning logic applies to the same relations as the parent plan. > And why does the name include the > word "root"? It's neither the PlannerGlobal object, which we often > call root, nor is it the root of the partitioning hierarchy. To me, it > looks like it's just the set of relids that we can potentially prune. > I don't see why this isn't just called "relids", like the field from > which it's copied: > > + pruneinfo->root_parent_relids = parentrel->relids; > > It just doesn't seem very root-y or very parent-y. Maybe just "relids" suffices with a comment updated like this: * relids RelOptInfo.relids of the parent plan node (e.g. Append * or MergeAppend) to which his PartitionPruneInfo node * belongs. Used to ensure that the pruning logic matches * the parent plan's apprelids. > - node->part_prune_info = partpruneinfo; > + > > Extra blank line. Fixed. > In 0002, the handling of ExprContexts seems a little bit hard to > understand. Sometimes we're using the PlanState's ExprContext, and > sometimes we're using a separate context owned by the > PartitionedRelPruningData's context, and it's not exactly clear why > that is or what the consequences are. Likewise I wouldn't mind some > more comments or explanation in the commit message of the changes in > this patch related to EState objects. I can't help wondering if the > changes here could have either semantic implications (like expression > evaluation can produce different results than before) or performance > implications (because we create objects that we didn't previously > create). I have taken another look at whether there's any real need to use separate ExprContexts for initial and runtime pruning and ISTM there isn't, so we can make "exec" pruning use the same ExprContext as what "init" would have used. There *is* a difference however in how we initializing the partition key expressions for initial and runtime pruning, but it's not problematic to use the same ExprContext. I'll update the commentary a bit more. > Typo: partrtitioned Fixed. > Regrettably, I have not looked seriously at 0004 and 0005, so I can't > comment on those. Ok, I'm updating 0005 to change how the CachedPlan is handled when it becomes invalid during InitPlan(). Currently (v56), a separate transient CachedPlan is created for the query being initialized when invalidation occurs. However, it seems better to update the original CachedPlan in place to avoid extra bookkeeping for transient plans—an approach Robert suggested in an off-list discussion. Will post a new version next week. -- Thanks, Amit Langote
On Fri, Oct 11, 2024 at 3:30 AM Amit Langote <amitlangote09@gmail.com> wrote: > Maybe just "relids" suffices with a comment updated like this: > > * relids RelOptInfo.relids of the parent plan node (e.g. Append > * or MergeAppend) to which his PartitionPruneInfo node > * belongs. Used to ensure that the pruning logic matches > * the parent plan's apprelids. LGTM. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 11, 2024 at 3:30 AM Amit Langote <amitlangote09@gmail.com> wrote: >> Maybe just "relids" suffices with a comment updated like this: >> >> * relids RelOptInfo.relids of the parent plan node (e.g. Append >> * or MergeAppend) to which his PartitionPruneInfo node >> * belongs. Used to ensure that the pruning logic matches >> * the parent plan's apprelids. > LGTM. "his" -> "this", surely? regards, tom lane
Hi Tomas, On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote: > Hi, > > I took a look at this patch, mostly to familiarize myself with the > pruning etc. I have a bunch of comments, but all of that is minor, > perhaps even nitpicking - with prior feedback from David, Tom and > Robert, I can't really compete with that. Thanks for looking at this. These are helpful. > FWIW the patch needs a rebase, there's a minor bitrot - but it was > simply enough to fix for a review / testing. > > > 0001 > ---- > > 1) But if we don't expect this error to actually happen, do we really > need to make it ereport()? Maybe it should be plain elog(). I mean, it's > "can't happen" and thus doesn't need translations etc. > > if (!bms_equal(relids, pruneinfo->relids)) > ereport(ERROR, > errcode(ERRCODE_INTERNAL_ERROR), > errmsg_internal("mismatching PartitionPruneInfo found at > part_prune_index %d", > part_prune_index), > errdetail_internal("plan node relids %s, pruneinfo > relids %s", > bmsToString(relids), > bmsToString(pruneinfo->relids))); I'm fine with elog() here even if it causes the message to be longer: elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index %d (plan node relids %s, pruneinfo relids %s) > Perhaps it should even be an assert? I am not sure about that. Having a message handy might be good if a user ends up hitting this case for whatever reason, like trying to run a corrupted plan. > 2) unnecessary newline added to execPartition.h Perhaps you meant "removed". Fixed. > 3) this comment in EState doesn't seem very helpful > > List *es_part_prune_infos; /* PlannedStmt.partPruneInfos */ Agreed, fixed to be like the comment for es_rteperminfos: List *es_part_prune_infos; /* List of PartitionPruneInfo */ > 5) PlannerGlobal > > /* List of PartitionPruneInfo contained in the plan */ > List *partPruneInfos; > > Why does this say "contained in the plan" unlike the other fields? Is > there some sort of difference? I'm not saying it's wrong. Ok, maybe the following is a bit more helpful and like the comment for other fields: /* "flat" list of PartitionPruneInfos */ List *partPruneInfos; > 0002 > ---- > > 1) Isn't it weird/undesirable partkey_datum_from_expr() loses some of > the asserts? Would the assert be incorrect in the new implementation, or > are we removing it simply because we happen to not have one of the fields? The former -- the asserts would be incorrect in the new implementation -- because in the new implementation a standalone ExprContext is used that is independent of the parent PlanState (when available) for both types of runtime pruning. The old asserts, particularly the second one, weren't asserting something very useful anyway, IMO. What I mean is that the ExprContext provided in the PartitionPruneContext to be the same as the parent PlanState's ps_ExprContext isn't critical to the code that follows. Nor whether the PlanState is available or not. > 2) inconsistent spelling: run-time vs. runtime I assume you meant in this comment: * estate The EState for the query doing runtime pruning Fixed by using run-time, which is a more commonly used term in the source code than runtime. > 3) PartitionPruneContext.is_valid - I think I'd rename the flag to > "initialized" or something like that. The "is_valid" is a bit confusing, > because it might seem the context can get invalidated later, but AFAICS > that's not the case - we just initialize it lazily. Agree that "initialized" is better, so renamed. > 0003 > ---- > > 1) In InitPlan I'd move > > estate->es_part_prune_infos = plannedstmt->partPruneInfos; > > before the comment, which is more about ExecDoInitialPruning. Makes sense, done. > 2) I'm not quite sure what "exec" partition pruning is? > > /* > * ExecInitPartitionPruning > * Initialize the data structures needed for runtime "exec" partition > * pruning and return the result of initial pruning, if available. > > Is that the same thing as "runtime pruning"? "Exec" pruning refers to pruning performed during execution, using PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan initialization, using parameters whose values remain constant during execution, such as PARAM_EXTERN parameters and stable functions. Before this patch, the ExecInitPartitionPruning function, called during ExecutorStart(), performed "init" pruning and set up state in the PartitionPruneState for subsequent "exec" pruning during ExecutorRun(). With this patch, "init" pruning is performed well before this function is called, leaving its sole responsibility to setting up the state for "exec" pruning. It may be worth renaming the function to better reflect this new role, rather than updating only the comment. Actually, that is what I decided to do in the attached, along with some other adjustments like moving ExecDoInitialPruning() to execPartition.c from execMain.c, fixing up some obsolete comments, etc. > 0004 > ---- > > 1) typo: paraller/parallel Oops, fixed. > 2) What about adding an assert to ExecFindMatchingSubPlans, to check > valisubplan_rtis is not NULL? It's just mentioned in a comment, but > better to explicitly enforce that? Good idea, done. > > 2) It may not be quite clear why ExecInitUpdateProjection() switches to > mt_updateColnosLists. Should that be explained in a comment, somewhere? There is a comment in the ModifyTableState struct definition: /* * List of valid updateColnosLists. Contains only those belonging to * unpruned relations from ModifyTable.updateColnosLists. */ List *mt_updateColnosLists; It seems redundant to reiterate this in ExecInitUpdateProjection(). > 3) unnecessary newline in ExecLookupResultRelByOid Removed. > 0005 > ---- > > 1) auto_explain.c - So what happens if the plan gets invalidated? The > hook explain_ExecutorStart returns early, but then what? Does that break > the user session somehow, or what? It will get called again after ExecutorStartExt() loops back to do ExecutorStart() with a new updated plan tree. > 2) Isn't it a bit fragile if this requires every extension to update > and add the ExecPlanStillValid() calls to various places? The ExecPlanStillValid() call only needs to be added immediately after the call to standard_ExecutorStart() in an extension's ExecutorStart_hook() implementation. > What if an > extension doesn't do that? What weirdness will happen? The QueryDesc.planstate won't contain a PlanState tree for starters and other state information that InitPlan() populates in EState based on the PlannedStmt. > Maybe it'd be > possible to at least check this in some other executor hook? Or at least > we could ensure the check was done in assert-enabled builds? Or > something to make extension authors aware of this? I've added a note in the commit message, but if that's not enough, one idea might be to change the return type of ExecutorStart_hook so that the extensions that implement it are forced to be adjusted. Say, from void to bool to indicate whether standard_ExecutorStart() succeeded and thus created a "valid" plan. I had that in the previous versions of the patch. Thoughts? > Aside from going through the patches, I did a simple benchmark to see > how this works in practice. I did a simple test, with pgbench -S and > variable number of partitions/clients. I also varied the number of locks > per transaction, because I was wondering if it may interact with the > fast-path improvements. See the attached xeon.sh script and CSV with > results from the 44/88-core machine. > > There's also two PDFs visualizing the results, to show the impact as a > difference between "master" (no patches) vs. "pruning" build with v57 > applied. As usual, "green" is good (faster), read is "bad" (slower). > > For most combinations of parameters, there's no impact on throughput. > Anything in 99-101% is just regular noise, possibly even more. I'm > trying to reduce the noise a bit more, but this seems acceptable. I'd > like to discuss three "cases" I see in the results: Thanks for doing these benchmarks. I'll reply separately to discuss the individual cases. > costing / auto mode > ------------------- > > Anyway, this leads me to a related question - not quite a "bug" in the > patch, but something to perhaps think about. And that's costing, and > what "auto" should do. > > There are two PNG charts, showing throughput for runs with -M prepared > and 1000 partitions. Each chart shows throughput for the three cache > modes, and different client counts. There's a clear distinction between > "master" and "patched" runs - the "generic" plans performed terribly, by > orders of magnitude. With the patches it beats the "custom" plans. > > Which is great! But it also means that while "auto" used to do the right > thing, with the patches that's not the case. > > AFAIK that's because we don't consider the runtime pruning when costing > the plans, so the cost is calculated as if no pruning happened. And so > it seems way more expensive than it should ... and it loses with the > custom scans. Is that correct, or do I understand this wrong? That's correct. The planner does not consider runtime pruning when assigning costs to Append or MergeAppend paths in create_{merge}append_path(). > Just to be clear, I'm not claiming the patch has to deal with this. I > suppose it can be handled as a future improvement, and I'm not even sure > there's a good way to consider this during costing. For example, can we > estimate how many partitions will be pruned? There have been discussions about this in the 2017 development thread of run-time pruning [1] and likely at some later point in other threads. One simple approach mentioned at [1] is to consider that only 1 partition will be scanned for queries containing WHERE partkey = $1, because only 1 partition can contain matching rows with that condition. I agree that this should be dealt with sooner than later so users get generic plans even without having to use force_generic_plan. I'll post the updated patches tomorrow. -- Thanks, Amit Langote [1] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com
On 12/4/24 14:34, Amit Langote wrote: > Hi Tomas, > > On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote: >> Hi, >> >> I took a look at this patch, mostly to familiarize myself with the >> pruning etc. I have a bunch of comments, but all of that is minor, >> perhaps even nitpicking - with prior feedback from David, Tom and >> Robert, I can't really compete with that. > > Thanks for looking at this. These are helpful. > >> FWIW the patch needs a rebase, there's a minor bitrot - but it was >> simply enough to fix for a review / testing. >> >> >> 0001 >> ---- >> >> 1) But if we don't expect this error to actually happen, do we really >> need to make it ereport()? Maybe it should be plain elog(). I mean, it's >> "can't happen" and thus doesn't need translations etc. >> >> if (!bms_equal(relids, pruneinfo->relids)) >> ereport(ERROR, >> errcode(ERRCODE_INTERNAL_ERROR), >> errmsg_internal("mismatching PartitionPruneInfo found at >> part_prune_index %d", >> part_prune_index), >> errdetail_internal("plan node relids %s, pruneinfo >> relids %s", >> bmsToString(relids), >> bmsToString(pruneinfo->relids))); > > I'm fine with elog() here even if it causes the message to be longer: > > elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index > %d (plan node relids %s, pruneinfo relids %s) > I'm not forcing you to do elog, if you think ereport() is better. I'm only asking because AFAIK the "policy" is that ereport is for cases that think can happen (and thus get translated), while elog(ERROR) is for cases that we believe shouldn't happen. So every time I see "ereport" I ask myself "how could this happen" which doesn't seem to be the case here. >> Perhaps it should even be an assert? > > I am not sure about that. Having a message handy might be good if a > user ends up hitting this case for whatever reason, like trying to run > a corrupted plan. > I'm a bit skeptical about this, TBH. If we assume the plan is "corrupted", why should we notice in this particular place? I mean, it could be corrupted in a million different ways, and the chance that it got through all the earlier steps is like 1 in a 1.000.000. >> 2) unnecessary newline added to execPartition.h > > Perhaps you meant "removed". Fixed. > Yes, sorry. I misread the diff. >> 5) PlannerGlobal >> >> /* List of PartitionPruneInfo contained in the plan */ >> List *partPruneInfos; >> >> Why does this say "contained in the plan" unlike the other fields? Is >> there some sort of difference? I'm not saying it's wrong. > > Ok, maybe the following is a bit more helpful and like the comment for > other fields: > > /* "flat" list of PartitionPruneInfos */ > List *partPruneInfos; > WFM >> 0002 >> ---- >> >> 1) Isn't it weird/undesirable partkey_datum_from_expr() loses some of >> the asserts? Would the assert be incorrect in the new implementation, or >> are we removing it simply because we happen to not have one of the fields? > > The former -- the asserts would be incorrect in the new implementation > -- because in the new implementation a standalone ExprContext is used > that is independent of the parent PlanState (when available) for both > types of runtime pruning. > > The old asserts, particularly the second one, weren't asserting > something very useful anyway, IMO. What I mean is that the > ExprContext provided in the PartitionPruneContext to be the same as > the parent PlanState's ps_ExprContext isn't critical to the code that > follows. Nor whether the PlanState is available or not. > OK, thanks for explaining >> 2) inconsistent spelling: run-time vs. runtime > > I assume you meant in this comment: > > * estate The EState for the query doing runtime pruning > > Fixed by using run-time, which is a more commonly used term in the > source code than runtime. > Not quite. I was looking at runtime/run-time in the patch files, but now I realize some of that is preexisting ... Still, maybe the patch should stick to one spelling. >> 2) I'm not quite sure what "exec" partition pruning is? >> >> /* >> * ExecInitPartitionPruning >> * Initialize the data structures needed for runtime "exec" partition >> * pruning and return the result of initial pruning, if available. >> >> Is that the same thing as "runtime pruning"? > > "Exec" pruning refers to pruning performed during execution, using > PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan > initialization, using parameters whose values remain constant during > execution, such as PARAM_EXTERN parameters and stable functions. > > Before this patch, the ExecInitPartitionPruning function, called > during ExecutorStart(), performed "init" pruning and set up state in > the PartitionPruneState for subsequent "exec" pruning during > ExecutorRun(). With this patch, "init" pruning is performed well > before this function is called, leaving its sole responsibility to > setting up the state for "exec" pruning. It may be worth renaming the > function to better reflect this new role, rather than updating only > the comment. > > Actually, that is what I decided to do in the attached, along with > some other adjustments like moving ExecDoInitialPruning() to > execPartition.c from execMain.c, fixing up some obsolete comments, > etc. > I don't see any attachment :-( Anyway, if I understand correctly, the "runtime pruning" has two separate cases - initial pruning and exec pruning. Is that right? > >> >> 2) It may not be quite clear why ExecInitUpdateProjection() switches to >> mt_updateColnosLists. Should that be explained in a comment, somewhere? > > There is a comment in the ModifyTableState struct definition: > > /* > * List of valid updateColnosLists. Contains only those belonging to > * unpruned relations from ModifyTable.updateColnosLists. > */ > List *mt_updateColnosLists; > > It seems redundant to reiterate this in ExecInitUpdateProjection(). > Ah, I see. Makes sense. > >> 0005 >> ---- >> >> 1) auto_explain.c - So what happens if the plan gets invalidated? The >> hook explain_ExecutorStart returns early, but then what? Does that break >> the user session somehow, or what? > > It will get called again after ExecutorStartExt() loops back to do > ExecutorStart() with a new updated plan tree. > >> 2) Isn't it a bit fragile if this requires every extension to update >> and add the ExecPlanStillValid() calls to various places? > > The ExecPlanStillValid() call only needs to be added immediately after > the call to standard_ExecutorStart() in an extension's > ExecutorStart_hook() implementation. > >> What if an >> extension doesn't do that? What weirdness will happen? > > The QueryDesc.planstate won't contain a PlanState tree for starters > and other state information that InitPlan() populates in EState based > on the PlannedStmt. > OK, and the consequence is that the query will fail, right? >> Maybe it'd be >> possible to at least check this in some other executor hook? Or at least >> we could ensure the check was done in assert-enabled builds? Or >> something to make extension authors aware of this? > > I've added a note in the commit message, but if that's not enough, one > idea might be to change the return type of ExecutorStart_hook so that > the extensions that implement it are forced to be adjusted. Say, from > void to bool to indicate whether standard_ExecutorStart() succeeded > and thus created a "valid" plan. I had that in the previous versions > of the patch. Thoughts? > Maybe. My concern is that this case (plan getting invalidated) is fairly rare, so it's entirely plausible the extension will seem to work just fine without the code update for a long time. Sure, changing the APIs is allowed, I'm just wondering if maybe there might be a way to not have this issue, or at least notice the missing call early. I haven't tried, wouldn't it be better to modify ExecutorStart() to do the retries internally? I mean, the extensions wouldn't need to check if the plan is still valid, ExecutorStart() would take care of that. Yeah, it might need some new arguments, but that's more obvious. >> Aside from going through the patches, I did a simple benchmark to see >> how this works in practice. I did a simple test, with pgbench -S and >> variable number of partitions/clients. I also varied the number of locks >> per transaction, because I was wondering if it may interact with the >> fast-path improvements. See the attached xeon.sh script and CSV with >> results from the 44/88-core machine. >> >> There's also two PDFs visualizing the results, to show the impact as a >> difference between "master" (no patches) vs. "pruning" build with v57 >> applied. As usual, "green" is good (faster), read is "bad" (slower). >> >> For most combinations of parameters, there's no impact on throughput. >> Anything in 99-101% is just regular noise, possibly even more. I'm >> trying to reduce the noise a bit more, but this seems acceptable. I'd >> like to discuss three "cases" I see in the results: > > Thanks for doing these benchmarks. I'll reply separately to discuss > the individual cases. > >> costing / auto mode >> ------------------- >> >> Anyway, this leads me to a related question - not quite a "bug" in the >> patch, but something to perhaps think about. And that's costing, and >> what "auto" should do. >> >> There are two PNG charts, showing throughput for runs with -M prepared >> and 1000 partitions. Each chart shows throughput for the three cache >> modes, and different client counts. There's a clear distinction between >> "master" and "patched" runs - the "generic" plans performed terribly, by >> orders of magnitude. With the patches it beats the "custom" plans. >> >> Which is great! But it also means that while "auto" used to do the right >> thing, with the patches that's not the case. >> >> AFAIK that's because we don't consider the runtime pruning when costing >> the plans, so the cost is calculated as if no pruning happened. And so >> it seems way more expensive than it should ... and it loses with the >> custom scans. Is that correct, or do I understand this wrong? > > That's correct. The planner does not consider runtime pruning when > assigning costs to Append or MergeAppend paths in > create_{merge}append_path(). > >> Just to be clear, I'm not claiming the patch has to deal with this. I >> suppose it can be handled as a future improvement, and I'm not even sure >> there's a good way to consider this during costing. For example, can we >> estimate how many partitions will be pruned? > > There have been discussions about this in the 2017 development thread > of run-time pruning [1] and likely at some later point in other > threads. One simple approach mentioned at [1] is to consider that > only 1 partition will be scanned for queries containing WHERE partkey > = $1, because only 1 partition can contain matching rows with that > condition. > > I agree that this should be dealt with sooner than later so users get > generic plans even without having to use force_generic_plan. > > I'll post the updated patches tomorrow. > Cool, thanks! regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > I'm not forcing you to do elog, if you think ereport() is better. I'm > only asking because AFAIK the "policy" is that ereport is for cases that > think can happen (and thus get translated), while elog(ERROR) is for > cases that we believe shouldn't happen. The proposed coding looks fine from that perspective, because it uses errmsg_internal and errdetail_internal which don't give rise to translatable strings. Having said that, if we think this is a "can't happen" case then it's fair to wonder why go to such lengths to format it prettily. Also, I'd argue that the error message style guidelines still apply, but this errdetail doesn't conform. regards, tom lane
On Thu, Dec 5, 2024 at 2:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas@vondra.me> writes: > > I'm not forcing you to do elog, if you think ereport() is better. I'm > > only asking because AFAIK the "policy" is that ereport is for cases that > > think can happen (and thus get translated), while elog(ERROR) is for > > cases that we believe shouldn't happen. > > The proposed coding looks fine from that perspective, because it uses > errmsg_internal and errdetail_internal which don't give rise to > translatable strings. Having said that, if we think this is a > "can't happen" case then it's fair to wonder why go to such lengths > to format it prettily. Also, I'd argue that the error message > style guidelines still apply, but this errdetail doesn't conform. Thinking about this further, perhaps an Assert is sufficient here. An Append/MergeAppend node's part_prune_index not pointing to the correct entry in the global "flat" list of PartitionPruneInfos would indicate a bug. It seems unlikely that user actions could cause this issue. -- Thanks, Amit Langote
On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote: > On 12/4/24 14:34, Amit Langote wrote: > > On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote: > >> 0001 > >> ---- > >> > >> 1) But if we don't expect this error to actually happen, do we really > >> need to make it ereport()? Maybe it should be plain elog(). I mean, it's > >> "can't happen" and thus doesn't need translations etc. > >> > >> if (!bms_equal(relids, pruneinfo->relids)) > >> ereport(ERROR, > >> errcode(ERRCODE_INTERNAL_ERROR), > >> errmsg_internal("mismatching PartitionPruneInfo found at > >> part_prune_index %d", > >> part_prune_index), > >> errdetail_internal("plan node relids %s, pruneinfo > >> relids %s", > >> bmsToString(relids), > >> bmsToString(pruneinfo->relids))); > > > > I'm fine with elog() here even if it causes the message to be longer: > > > > elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index > > %d (plan node relids %s, pruneinfo relids %s) > > > > I'm not forcing you to do elog, if you think ereport() is better. I'm > only asking because AFAIK the "policy" is that ereport is for cases that > think can happen (and thus get translated), while elog(ERROR) is for > cases that we believe shouldn't happen. > > So every time I see "ereport" I ask myself "how could this happen" which > doesn't seem to be the case here. > > >> Perhaps it should even be an assert? > > > > I am not sure about that. Having a message handy might be good if a > > user ends up hitting this case for whatever reason, like trying to run > > a corrupted plan. > > I'm a bit skeptical about this, TBH. If we assume the plan is > "corrupted", why should we notice in this particular place? I mean, it > could be corrupted in a million different ways, and the chance that it > got through all the earlier steps is like 1 in a 1.000.000. Yeah, I am starting to think the same. Btw, the idea to have a check and elog() / ereport() came from Alvaro upthread: https://www.postgresql.org/message-id/20221130181201.mfinyvtob3j5i2a6%40alvherre.pgsql > >> 2) I'm not quite sure what "exec" partition pruning is? > >> > >> /* > >> * ExecInitPartitionPruning > >> * Initialize the data structures needed for runtime "exec" partition > >> * pruning and return the result of initial pruning, if available. > >> > >> Is that the same thing as "runtime pruning"? > > > > "Exec" pruning refers to pruning performed during execution, using > > PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan > > initialization, using parameters whose values remain constant during > > execution, such as PARAM_EXTERN parameters and stable functions. > > > > Before this patch, the ExecInitPartitionPruning function, called > > during ExecutorStart(), performed "init" pruning and set up state in > > the PartitionPruneState for subsequent "exec" pruning during > > ExecutorRun(). With this patch, "init" pruning is performed well > > before this function is called, leaving its sole responsibility to > > setting up the state for "exec" pruning. It may be worth renaming the > > function to better reflect this new role, rather than updating only > > the comment. > > > > Actually, that is what I decided to do in the attached, along with > > some other adjustments like moving ExecDoInitialPruning() to > > execPartition.c from execMain.c, fixing up some obsolete comments, > > etc. > > > > I don't see any attachment :-( > > Anyway, if I understand correctly, the "runtime pruning" has two > separate cases - initial pruning and exec pruning. Is that right? That's correct. These patches are about performing "initial" pruning at a different time and place so that we can take the deferred locks on the unpruned partitions before we perform ExecInitNode() on any of the plan trees in the PlannedStmt. > >> 0005 > >> ---- > >> > >> 1) auto_explain.c - So what happens if the plan gets invalidated? The > >> hook explain_ExecutorStart returns early, but then what? Does that break > >> the user session somehow, or what? > > > > It will get called again after ExecutorStartExt() loops back to do > > ExecutorStart() with a new updated plan tree. > > > >> 2) Isn't it a bit fragile if this requires every extension to update > >> and add the ExecPlanStillValid() calls to various places? > > > > The ExecPlanStillValid() call only needs to be added immediately after > > the call to standard_ExecutorStart() in an extension's > > ExecutorStart_hook() implementation. > > > >> What if an > >> extension doesn't do that? What weirdness will happen? > > > > The QueryDesc.planstate won't contain a PlanState tree for starters > > and other state information that InitPlan() populates in EState based > > on the PlannedStmt. > > OK, and the consequence is that the query will fail, right? No, the core executor will retry the execution with a new updated plan. In the absence of the early return, the extension might even crash when accessing such incomplete QueryDesc. What the patch makes the ExecutorStart_hook do is similar to how InitPlan() will return early when locks taken on partitions that survive initial pruning invalidate the plan. > >> Maybe it'd be > >> possible to at least check this in some other executor hook? Or at least > >> we could ensure the check was done in assert-enabled builds? Or > >> something to make extension authors aware of this? > > > > I've added a note in the commit message, but if that's not enough, one > > idea might be to change the return type of ExecutorStart_hook so that > > the extensions that implement it are forced to be adjusted. Say, from > > void to bool to indicate whether standard_ExecutorStart() succeeded > > and thus created a "valid" plan. I had that in the previous versions > > of the patch. Thoughts? > > Maybe. My concern is that this case (plan getting invalidated) is fairly > rare, so it's entirely plausible the extension will seem to work just > fine without the code update for a long time. You might see the errors like the one below when the core executor or a hook tries to initialize or process in some other way a known invalid plan, for example, because an unpruned partition's index got concurrently dropped before the executor got the lock: ERROR: could not open relation with OID xxx > Sure, changing the APIs is allowed, I'm just wondering if maybe there > might be a way to not have this issue, or at least notice the missing > call early. > > I haven't tried, wouldn't it be better to modify ExecutorStart() to do > the retries internally? I mean, the extensions wouldn't need to check if > the plan is still valid, ExecutorStart() would take care of that. Yeah, > it might need some new arguments, but that's more obvious. One approach could be to move some code from standard_ExecutorStart() into ExecutorStart(). Specifically, the code responsible for setting up enough state in the EState to perform ExecDoInitialPruning(), which takes locks that might invalidate the plan. If the plan does become invalid, the hook and standard_ExecutorStart() are not called. Instead, the caller, ExecutorStartExt() in this case, creates a new plan. This avoids the need to add ExecPlanStillValid() checks anywhere, whether in core or extension code. However, it does mean accessing the PlannedStmt earlier than InitPlan(), but the current placement of the code is not exactly set in stone. -- Thanks, Amit Langote
On Thu, Dec 5, 2024 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote: > > Sure, changing the APIs is allowed, I'm just wondering if maybe there > > might be a way to not have this issue, or at least notice the missing > > call early. > > > > I haven't tried, wouldn't it be better to modify ExecutorStart() to do > > the retries internally? I mean, the extensions wouldn't need to check if > > the plan is still valid, ExecutorStart() would take care of that. Yeah, > > it might need some new arguments, but that's more obvious. > > One approach could be to move some code from standard_ExecutorStart() > into ExecutorStart(). Specifically, the code responsible for setting > up enough state in the EState to perform ExecDoInitialPruning(), which > takes locks that might invalidate the plan. If the plan does become > invalid, the hook and standard_ExecutorStart() are not called. > Instead, the caller, ExecutorStartExt() in this case, creates a new > plan. > > This avoids the need to add ExecPlanStillValid() checks anywhere, > whether in core or extension code. However, it does mean accessing the > PlannedStmt earlier than InitPlan(), but the current placement of the > code is not exactly set in stone. I tried this approach and found that it essentially disables testing of this patch using the delay_execution module, which relies on the ExecutorStart_hook(). The way the testing works is that the hook in delay_execution.c pauses the execution of a cached plan to allow a concurrent session to drop an index referenced in the plan. When unpaused, execution initialization resumes by calling standard_ExecutorStart(). At this point, obtaining the lock on the partition whose index has been dropped invalidates the plan, which the hook detects and reports. It then also reports the successful re-execution of an updated plan that no longer references the dropped index. Hmm. -- Thanks, Amit Langote
On 12/5/24 07:53, Amit Langote wrote: > On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote: >> ... >> >>>> What if an >>>> extension doesn't do that? What weirdness will happen? >>> >>> The QueryDesc.planstate won't contain a PlanState tree for starters >>> and other state information that InitPlan() populates in EState based >>> on the PlannedStmt. >> >> OK, and the consequence is that the query will fail, right? > > No, the core executor will retry the execution with a new updated > plan. In the absence of the early return, the extension might even > crash when accessing such incomplete QueryDesc. > > What the patch makes the ExecutorStart_hook do is similar to how > InitPlan() will return early when locks taken on partitions that > survive initial pruning invalidate the plan. > Isn't that what I said? My question was what happens if the extension does not add the new ExecPlanStillValid() call - sorry if that wasn't clear. If it can crash, that's what I meant by "fail". >>>> Maybe it'd be >>>> possible to at least check this in some other executor hook? Or at least >>>> we could ensure the check was done in assert-enabled builds? Or >>>> something to make extension authors aware of this? >>> >>> I've added a note in the commit message, but if that's not enough, one >>> idea might be to change the return type of ExecutorStart_hook so that >>> the extensions that implement it are forced to be adjusted. Say, from >>> void to bool to indicate whether standard_ExecutorStart() succeeded >>> and thus created a "valid" plan. I had that in the previous versions >>> of the patch. Thoughts? >> >> Maybe. My concern is that this case (plan getting invalidated) is fairly >> rare, so it's entirely plausible the extension will seem to work just >> fine without the code update for a long time. > > You might see the errors like the one below when the core executor or > a hook tries to initialize or process in some other way a known > invalid plan, for example, because an unpruned partition's index got > concurrently dropped before the executor got the lock: > > ERROR: could not open relation with OID xxx > Yeah, but how likely is that? How often get plans invalidated in regular application workload. People don't create or drop indexes very often, for example ... Again, I'm not saying requiring the call would be unacceptable, I'm sure we made similar changes in the past. But if it wasn't needed without too much contortion, that would be nice. regards -- Tomas Vondra
On 12/5/24 12:28, Amit Langote wrote: > On Thu, Dec 5, 2024 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote: >> On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote: >>> Sure, changing the APIs is allowed, I'm just wondering if maybe there >>> might be a way to not have this issue, or at least notice the missing >>> call early. >>> >>> I haven't tried, wouldn't it be better to modify ExecutorStart() to do >>> the retries internally? I mean, the extensions wouldn't need to check if >>> the plan is still valid, ExecutorStart() would take care of that. Yeah, >>> it might need some new arguments, but that's more obvious. >> >> One approach could be to move some code from standard_ExecutorStart() >> into ExecutorStart(). Specifically, the code responsible for setting >> up enough state in the EState to perform ExecDoInitialPruning(), which >> takes locks that might invalidate the plan. If the plan does become >> invalid, the hook and standard_ExecutorStart() are not called. >> Instead, the caller, ExecutorStartExt() in this case, creates a new >> plan. >> >> This avoids the need to add ExecPlanStillValid() checks anywhere, >> whether in core or extension code. However, it does mean accessing the >> PlannedStmt earlier than InitPlan(), but the current placement of the >> code is not exactly set in stone. > > I tried this approach and found that it essentially disables testing > of this patch using the delay_execution module, which relies on the > ExecutorStart_hook(). The way the testing works is that the hook in > delay_execution.c pauses the execution of a cached plan to allow a > concurrent session to drop an index referenced in the plan. When > unpaused, execution initialization resumes by calling > standard_ExecutorStart(). At this point, obtaining the lock on the > partition whose index has been dropped invalidates the plan, which the > hook detects and reports. It then also reports the successful > re-execution of an updated plan that no longer references the dropped > index. Hmm. > It's not clear to me why the change disables this testing, and I can't try without a patch. Could you explain? thanks -- Tomas Vondra
On Thu, Dec 5, 2024 at 10:53 PM Tomas Vondra <tomas@vondra.me> wrote: > On 12/5/24 07:53, Amit Langote wrote: > > On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote: > >> ... > >> > >>>> What if an > >>>> extension doesn't do that? What weirdness will happen? > >>> > >>> The QueryDesc.planstate won't contain a PlanState tree for starters > >>> and other state information that InitPlan() populates in EState based > >>> on the PlannedStmt. > >> > >> OK, and the consequence is that the query will fail, right? > > > > No, the core executor will retry the execution with a new updated > > plan. In the absence of the early return, the extension might even > > crash when accessing such incomplete QueryDesc. > > > > What the patch makes the ExecutorStart_hook do is similar to how > > InitPlan() will return early when locks taken on partitions that > > survive initial pruning invalidate the plan. > > Isn't that what I said? My question was what happens if the extension > does not add the new ExecPlanStillValid() call - sorry if that wasn't > clear. If it can crash, that's what I meant by "fail". Ok, I see. So, I suppose you meant to confirm if the invalid plan won't silently be executed returning wrong results. Yes, I don't think that would happen given the kinds of invalidations that are possible. The various checks in the ExecInitNode() path, such as the one that catches a missing index, will prevent the plan from running. I may not have searched exhaustively enough though. > >>>> Maybe it'd be > >>>> possible to at least check this in some other executor hook? Or at least > >>>> we could ensure the check was done in assert-enabled builds? Or > >>>> something to make extension authors aware of this? > >>> > >>> I've added a note in the commit message, but if that's not enough, one > >>> idea might be to change the return type of ExecutorStart_hook so that > >>> the extensions that implement it are forced to be adjusted. Say, from > >>> void to bool to indicate whether standard_ExecutorStart() succeeded > >>> and thus created a "valid" plan. I had that in the previous versions > >>> of the patch. Thoughts? > >> > >> Maybe. My concern is that this case (plan getting invalidated) is fairly > >> rare, so it's entirely plausible the extension will seem to work just > >> fine without the code update for a long time. > > > > You might see the errors like the one below when the core executor or > > a hook tries to initialize or process in some other way a known > > invalid plan, for example, because an unpruned partition's index got > > concurrently dropped before the executor got the lock: > > > > ERROR: could not open relation with OID xxx > > Yeah, but how likely is that? How often get plans invalidated in regular > application workload. People don't create or drop indexes very often, > for example ... Yeah, that's a valid point. Andres once mentioned that ANALYZE can invalidate plans and that can occur frequently in busy systems. > Again, I'm not saying requiring the call would be unacceptable, I'm sure > we made similar changes in the past. But if it wasn't needed without too > much contortion, that would be nice. I tend to agree. Another change introduced by the patch that extensions might need to mind (noted in the commit message of v58-0004) is the addition of the es_unpruned_relids field to EState. This field tracks the RT indexes of relations that are locked and therefore safe to access during execution. Importantly, it does not include the RT indexes of leaf partitions that are pruned during "initial" pruning and thus remain unlocked. This change means that executor extensions can no longer assume that all relations in the range table are locked and safe to access. Instead, extensions must account for the possibility that some relations, specifically pruned partitions, are not locked. Normally, executor code accesses relations using ExecGetRangeTableRelation(), which does not take a lock before returning the Relation pointer, assuming that locks are already managed upstream. -- Thanks, Amit Langote