Thread: generic plans and "initial" pruning

generic plans and "initial" pruning

From

Amit Langote

Date:

25 December 2021, 03:36:00

Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.

Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase.  A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.

However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking.  The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions.   I've attempted
implementing that idea in the attached patch.

Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.

Using the following benchmark setup:

pgbench testdb -i --partitions=$nparts > /dev/null 2>&1
pgbench -n testdb -S -T 30 -Mprepared

And plan_cache_mode = force_generic_plan,

I get following numbers:

HEAD:

32      tps = 20561.776403 (without initial connection time)
64      tps = 12553.131423 (without initial connection time)
128     tps = 13330.365696 (without initial connection time)
256     tps = 8605.723120 (without initial connection time)
512     tps = 4435.951139 (without initial connection time)
1024    tps = 2346.902973 (without initial connection time)
2048    tps = 1334.680971 (without initial connection time)

Patched:

32      tps = 27554.156077 (without initial connection time)
64      tps = 27531.161310 (without initial connection time)
128     tps = 27138.305677 (without initial connection time)
256     tps = 25825.467724 (without initial connection time)
512     tps = 19864.386305 (without initial connection time)
1024    tps = 18742.668944 (without initial connection time)
2048    tps = 16312.412704 (without initial connection time)

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com

[2] https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com

Attachment

v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch

Re: generic plans and "initial" pruning

From

Ashutosh Bapat

Date:

28 December 2021, 13:12:00

On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Executing generic plans involving partitions is known to become slower
> as partition count grows due to a number of bottlenecks, with
> AcquireExecutorLocks() showing at the top in profiles.
>
> Previous attempt at solving that problem was by David Rowley [1],
> where he proposed delaying locking of *all* partitions appearing under
> an Append/MergeAppend until "initial" pruning is done during the
> executor initialization phase.  A problem with that approach that he
> has described in [2] is that leaving partitions unlocked can lead to
> race conditions where the Plan node belonging to a partition can be
> invalidated when a concurrent session successfully alters the
> partition between AcquireExecutorLocks() saying the plan is okay to
> execute and then actually executing it.
>
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.
>

In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?

The benchmark is showing good results, indeed.


-- 
Best Wishes,
Ashutosh Bapat

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

31 December 2021, 02:26:11

On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Executing generic plans involving partitions is known to become slower
> as partition count grows due to a number of bottlenecks, with
> AcquireExecutorLocks() showing at the top in profiles.
>
> Previous attempt at solving that problem was by David Rowley [1],
> where he proposed delaying locking of *all* partitions appearing under
> an Append/MergeAppend until "initial" pruning is done during the
> executor initialization phase. A problem with that approach that he
> has described in [2] is that leaving partitions unlocked can lead to
> race conditions where the Plan node belonging to a partition can be
> invalidated when a concurrent session successfully alters the
> partition between AcquireExecutorLocks() saying the plan is okay to
> execute and then actually executing it.
>
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking. The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions. I've attempted
> implementing that idea in the attached patch.
>

In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?

This is mainly to be useful for prepared queries, so something like:

prepare q as select * from partitioned_table where key = $1;

And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for all partitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the number of partitions to determine that the plan is still usable / has not been invalidated; most of that is AcquireExecutorLocks().

Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process the range table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the plan is generic.

The benchmark is showing good results, indeed.

Thanks.

Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amul Sul

Date:

06 January 2022, 06:44:33

On Fri, Dec 31, 2021 at 7:56 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>> >
>> > Executing generic plans involving partitions is known to become slower
>> > as partition count grows due to a number of bottlenecks, with
>> > AcquireExecutorLocks() showing at the top in profiles.
>> >
>> > Previous attempt at solving that problem was by David Rowley [1],
>> > where he proposed delaying locking of *all* partitions appearing under
>> > an Append/MergeAppend until "initial" pruning is done during the
>> > executor initialization phase.  A problem with that approach that he
>> > has described in [2] is that leaving partitions unlocked can lead to
>> > race conditions where the Plan node belonging to a partition can be
>> > invalidated when a concurrent session successfully alters the
>> > partition between AcquireExecutorLocks() saying the plan is okay to
>> > execute and then actually executing it.
>> >
>> > However, using an idea that Robert suggested to me off-list a little
>> > while back, it seems possible to determine the set of partitions that
>> > we can safely skip locking.  The idea is to look at the "initial" or
>> > "pre-execution" pruning instructions contained in a given Append or
>> > MergeAppend node when AcquireExecutorLocks() is collecting the
>> > relations to lock and consider relations from only those sub-nodes
>> > that survive performing those instructions.   I've attempted
>> > implementing that idea in the attached patch.
>> >
>>
>> In which cases, we will have "pre-execution" pruning instructions that
>> can be used to skip locking partitions? Can you please give a few
>> examples where this approach will be useful?
>
>
> This is mainly to be useful for prepared queries, so something like:
>
> prepare q as select * from partitioned_table where key = $1;
>
> And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for
allpartitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the
numberof partitions to determine that the plan is still usable / has not been invalidated; most of that is
AcquireExecutorLocks().
>
> Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process
therange table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the
planis generic. 
>
>> The benchmark is showing good results, indeed.
>
Indeed.

Here are few comments for v1 patch:

+   /* Caller error if we get here without contains_init_steps */
+   Assert(pruneinfo->contains_init_steps);

-       prunedata = prunestate->partprunedata[i];
-       pprune = &prunedata->partrelprunedata[0];

-       /* Perform pruning without using PARAM_EXEC Params */
-       find_matching_subplans_recurse(prunedata, pprune, true, &result);
+   if (parentrelids)
+       *parentrelids = NULL;

You got two blank lines after Assert.
--

+   /* Set up EState if not in the executor proper. */
+   if (estate == NULL)
+   {
+       estate = CreateExecutorState();
+       estate->es_param_list_info = params;
+       free_estate = true;
    }

... [Skip]

+   if (free_estate)
+   {
+       FreeExecutorState(estate);
+       estate = NULL;
    }

I think this work should be left to the caller.
--

    /*
     * Stuff that follows matches exactly what ExecCreatePartitionPruneState()
     * does, except we don't need a PartitionPruneState here, so don't call
     * that function.
     *
     * XXX some refactoring might be good.
     */

+1, while doing it would be nice if foreach_current_index() is used
instead of the i & j sequence in the respective foreach() block, IMO.
--

+                   while ((i = bms_next_member(validsubplans, i)) >= 0)
+                   {
+                       Plan   *subplan = list_nth(subplans, i);
+
+                       context->relations =
+                           bms_add_members(context->relations,
+                                           get_plan_scanrelids(subplan));
+                   }

I think instead of get_plan_scanrelids() the
GetLockableRelations_worker() can be used; if so, then no need to add
get_plan_scanrelids() function.
--

     /* Nodes containing prunable subnodes. */
+       case T_MergeAppend:
+           {
+               PlannedStmt *plannedstmt = context->plannedstmt;
+               List       *rtable = plannedstmt->rtable;
+               ParamListInfo params = context->params;
+               PartitionPruneInfo *pruneinfo;
+               Bitmapset  *validsubplans;
+               Bitmapset  *parentrelids;

...
                if (pruneinfo && pruneinfo->contains_init_steps)
                {
                    int     i;
...
                   return false;
                }
            }
            break;

Most of the declarations need to be moved inside the if-block.

Also, initially, I was a bit concerned regarding this code block
inside GetLockableRelations_worker(), what if (pruneinfo &&
pruneinfo->contains_init_steps) evaluated to false? After debugging I
realized that plan_tree_walker() will do the needful -- a bit of
comment would have helped.
--

+       case T_CustomScan:
+           foreach(lc, ((CustomScan *) plan)->custom_plans)
+           {
+               if (walker((Plan *) lfirst(lc), context))
+                   return true;
+           }
+           break;

Why not plan_walk_members() call like other nodes?

Regards,
Amul

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

11 January 2022, 16:22:23

On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.

Hmm. The first question that occurs to me is whether this is fully safe.

Currently, AcquireExecutorLocks calls LockRelationOid for every
relation involved in the query. That means we will probably lock at
least one relation on which we previously had no lock and thus
AcceptInvalidationMessages(). That will end up marking the query as no
longer valid and CheckCachedPlan() will realize this and tell the
caller to replan. In the corner case where we already hold all the
required locks, we will not accept invalidation messages at this
point, but must have done so after acquiring the last of the locks
required, and if that didn't mark the plan invalid, it can't be
invalid now either. Either way, everything is fine.

With the proposed patch, we might never lock some of the relations
involved in the query. Therefore, if one of those relations has been
modified in some way that would invalidate the plan, we will
potentially fail to discover this, and will use the plan anyway. For
instance, suppose there's one particular partition that has an extra
index and the plan involves an Index Scan using that index. Now
suppose that the scan of the partition in question is pruned, but
meanwhile, the index has been dropped. Now we're running a plan that
scans a nonexistent index. Admittedly, we're not running that part of
the plan. But is that enough for this to be safe? There are things
(like EXPLAIN or auto_explain) that we might try to do even on a part
of the plan tree that we don't try to run. Those things might break,
because for example we won't be able to look up the name of an index
in the catalogs for EXPLAIN output if the index is gone.

This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example. The point is that to my knowledge we have no existing
case where we try to use a plan that might be only partly valid, so if
we introduce one, there's some risk there. I thought for a while, too,
about whether changes to some object in a part of the plan that we're
not executing could break things for the rest of the plan even if we
never do anything with the plan but execute it. I can't quite see any
actual hazard. For example, I thought about whether we might try to
get the tuple descriptor for the pruned-away object and get a
different tuple descriptor than we were expecting. I think we can't,
because (1) the pruned object has to be a partition, and tuple
descriptors have to match throughout the partitioning hierarchy,
except for column ordering, which currently can't be changed
after-the-fact and (2) IIRC, the tuple descriptor is stored in the
plan and not reconstructed at runtime and (3) if we don't end up
opening the relation because it's pruned, then we certainly can't do
anything with its tuple descriptor. But it might be worth giving more
thought to the question of whether there's any other way we could be
depending on the details of an object that ended up getting pruned.

> Note that "initial" pruning steps are now performed twice when
> executing generic plans: once in AcquireExecutorLocks() to find
> partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> determine the set of partition sub-nodes to be initialized for
> execution, though I wasn't able to come up with a good idea to avoid
> this duplication.

I think this is something that will need to be fixed somehow. Apart
from the CPU cost, it's scary to imagine that the set of nodes on
which we acquired locks might be different from the set of nodes that
we initialize. If we do the same computation twice, there must be some
non-zero probability of getting a different answer the second time,
even if the circumstances under which it would actually happen are
remote. Consider, for example, a function that is labeled IMMUTABLE
but is really VOLATILE. Now maybe you can get the system to lock one
set of partitions and then initialize a different set of partitions. I
don't think we want to try to reason about what consequences that
might have and prove that somehow it's going to be OK; I think we want
to nail the door shut very tightly to make sure that it can't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

12 January 2022, 14:31:53

Thanks for taking the time to look at this.

On Wed, Jan 12, 2022 at 1:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > However, using an idea that Robert suggested to me off-list a little
> > while back, it seems possible to determine the set of partitions that
> > we can safely skip locking.  The idea is to look at the "initial" or
> > "pre-execution" pruning instructions contained in a given Append or
> > MergeAppend node when AcquireExecutorLocks() is collecting the
> > relations to lock and consider relations from only those sub-nodes
> > that survive performing those instructions.   I've attempted
> > implementing that idea in the attached patch.
>
> Hmm. The first question that occurs to me is whether this is fully safe.
>
> Currently, AcquireExecutorLocks calls LockRelationOid for every
> relation involved in the query. That means we will probably lock at
> least one relation on which we previously had no lock and thus
> AcceptInvalidationMessages(). That will end up marking the query as no
> longer valid and CheckCachedPlan() will realize this and tell the
> caller to replan. In the corner case where we already hold all the
> required locks, we will not accept invalidation messages at this
> point, but must have done so after acquiring the last of the locks
> required, and if that didn't mark the plan invalid, it can't be
> invalid now either. Either way, everything is fine.
>
> With the proposed patch, we might never lock some of the relations
> involved in the query. Therefore, if one of those relations has been
> modified in some way that would invalidate the plan, we will
> potentially fail to discover this, and will use the plan anyway.  For
> instance, suppose there's one particular partition that has an extra
> index and the plan involves an Index Scan using that index. Now
> suppose that the scan of the partition in question is pruned, but
> meanwhile, the index has been dropped. Now we're running a plan that
> scans a nonexistent index. Admittedly, we're not running that part of
> the plan. But is that enough for this to be safe? There are things
> (like EXPLAIN or auto_explain) that we might try to do even on a part
> of the plan tree that we don't try to run. Those things might break,
> because for example we won't be able to look up the name of an index
> in the catalogs for EXPLAIN output if the index is gone.
>
> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example. The point is that to my knowledge we have no existing
> case where we try to use a plan that might be only partly valid, so if
> we introduce one, there's some risk there. I thought for a while, too,
> about whether changes to some object in a part of the plan that we're
> not executing could break things for the rest of the plan even if we
> never do anything with the plan but execute it. I can't quite see any
> actual hazard. For example, I thought about whether we might try to
> get the tuple descriptor for the pruned-away object and get a
> different tuple descriptor than we were expecting. I think we can't,
> because (1) the pruned object has to be a partition, and tuple
> descriptors have to match throughout the partitioning hierarchy,
> except for column ordering, which currently can't be changed
> after-the-fact and (2) IIRC, the tuple descriptor is stored in the
> plan and not reconstructed at runtime and (3) if we don't end up
> opening the relation because it's pruned, then we certainly can't do
> anything with its tuple descriptor. But it might be worth giving more
> thought to the question of whether there's any other way we could be
> depending on the details of an object that ended up getting pruned.

I have pondered on the possible hazards before writing the patch,
mainly because the concerns about a previously discussed proposal were
along similar lines [1].

IIUC, you're saying the plan tree is subject to inspection by non-core
code before ExecutorStart() has initialized a PlanState tree, which
must have discarded pruned portions of the plan tree.  I wouldn't
claim to have scanned *all* of the core code that could possibly
access the invalidated portions of the plan tree, but from what I have
seen, I couldn't find any site that does.  An ExecutorStart_hook()
gets to do that, but from what I can see it is expected to call
standard_ExecutorStart() before doing its thing and supposedly only
looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
also does ExecutorStart() before starting to look at the plan (the
PlanState tree), so must not run into pruned plan tree nodes.  All
that said, it does sound like wishful thinking to say that no problems
can possibly occur.

At first, I had tried to implement this such that the
Append/MergeAppend nodes are edited to record the result of initial
pruning, but it felt wrong to be munging the plan tree in plancache.c.

Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.

> > Note that "initial" pruning steps are now performed twice when
> > executing generic plans: once in AcquireExecutorLocks() to find
> > partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> > determine the set of partition sub-nodes to be initialized for
> > execution, though I wasn't able to come up with a good idea to avoid
> > this duplication.
>
> I think this is something that will need to be fixed somehow. Apart
> from the CPU cost, it's scary to imagine that the set of nodes on
> which we acquired locks might be different from the set of nodes that
> we initialize. If we do the same computation twice, there must be some
> non-zero probability of getting a different answer the second time,
> even if the circumstances under which it would actually happen are
> remote. Consider, for example, a function that is labeled IMMUTABLE
> but is really VOLATILE. Now maybe you can get the system to lock one
> set of partitions and then initialize a different set of partitions. I
> don't think we want to try to reason about what consequences that
> might have and prove that somehow it's going to be OK; I think we want
> to nail the door shut very tightly to make sure that it can't.

Yeah, the premise of the patch is that "initial" pruning steps produce
the same result both times.  I assumed that would be true because the
pruning steps are not allowed to contain any VOLATILE expressions.
Regarding the possibility that IMMUTABLE labeling of functions may be
incorrect, I haven't considered if the runtime pruning code can cope
or whether it should try to.  If such a case does occur in practice,
the bad outcome would be an Assert failure in
ExecGetRangeTableRelation() or using a partition unlocked in the
non-assert builds, the latter of which feels especially bad.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BTgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc%2B7%2BZ7pyNg%40mail.gmail.com

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

12 January 2022, 18:20:15

On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> I have pondered on the possible hazards before writing the patch,
> mainly because the concerns about a previously discussed proposal were
> along similar lines [1].

True. I think that the hazards are narrower with this proposal,
because if you *delay* locking a partition that you eventually need,
then you might end up trying to actually execute a portion of the plan
that's no longer valid. That seems like hopelessly bad news. On the
other hand, with this proposal, you skip locking altogether, but only
for parts of the plan that you don't plan to execute. That's still
kind of scary, but not to nearly the same degree.

> IIUC, you're saying the plan tree is subject to inspection by non-core
> code before ExecutorStart() has initialized a PlanState tree, which
> must have discarded pruned portions of the plan tree.  I wouldn't
> claim to have scanned *all* of the core code that could possibly
> access the invalidated portions of the plan tree, but from what I have
> seen, I couldn't find any site that does.  An ExecutorStart_hook()
> gets to do that, but from what I can see it is expected to call
> standard_ExecutorStart() before doing its thing and supposedly only
> looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
> also does ExecutorStart() before starting to look at the plan (the
> PlanState tree), so must not run into pruned plan tree nodes.  All
> that said, it does sound like wishful thinking to say that no problems
> can possibly occur.

Yeah. I don't think it's only non-core code we need to worry about
either. What if I just do EXPLAIN ANALYZE on a prepared query that
ends up pruning away some stuff? IIRC, the pruned subplans are not
shown, so we might escape disaster here, but FWIW if I'd committed
that code I would have pushed hard for showing those and saying "(not
executed)" .... so it's not too crazy to imagine a world in which
things work that way.

> At first, I had tried to implement this such that the
> Append/MergeAppend nodes are edited to record the result of initial
> pruning, but it felt wrong to be munging the plan tree in plancache.c.

It is. You can't munge the plan tree: it's required to be strictly
read-only once generated. It can be serialized and deserialized for
transmission to workers, and it can be shared across executions.

> Or, maybe this won't be a concern if performing ExecutorStart() is
> made a part of CheckCachedPlan() somehow, which would then take locks
> on the relation as the PlanState tree is built capturing any plan
> invalidations, instead of AcquireExecutorLocks(). That does sound like
> an ambitious undertaking though.

On the surface that would seem to involve abstraction violations, but
maybe that could be finessed somehow. The plancache shouldn't know too
much about what the executor is going to do with the plan, but it
could ask the executor to perform a step that has been designed for
use by the plancache. I guess the core problem here is how to pass
around information that is node-specific before we've stood up the
executor state tree. Maybe the executor could have a function that
does the pruning and returns some kind of array of results that can be
used both to decide what to lock and also what to consider as pruned
at the start of execution. (I'm hand-waving about the details because
I don't know.)

> Yeah, the premise of the patch is that "initial" pruning steps produce
> the same result both times.  I assumed that would be true because the
> pruning steps are not allowed to contain any VOLATILE expressions.
> Regarding the possibility that IMMUTABLE labeling of functions may be
> incorrect, I haven't considered if the runtime pruning code can cope
> or whether it should try to.  If such a case does occur in practice,
> the bad outcome would be an Assert failure in
> ExecGetRangeTableRelation() or using a partition unlocked in the
> non-assert builds, the latter of which feels especially bad.

Right. I think it's OK for a query to produce wrong answers under
those kinds of conditions - the user has broken everything and gets to
keep all the pieces - but doing stuff that might violate fundamental
assumptions of the system like "relations can only be accessed when
holding a lock on them" feels quite bad. It's not a stretch to imagine
that failing to follow those invariants could take the whole system
down, which is clearly too severe a consequence for the user's failure
to label things properly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

14 January 2022, 14:10:43

On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
> Here are few comments for v1 patch:

Thanks Amul.  I'm thinking about Robert's latest comments addressing
which may need some rethinking of this whole design, but I decided to
post a v2 taking care of your comments.

> +   /* Caller error if we get here without contains_init_steps */
> +   Assert(pruneinfo->contains_init_steps);
>
> -       prunedata = prunestate->partprunedata[i];
> -       pprune = &prunedata->partrelprunedata[0];
>
> -       /* Perform pruning without using PARAM_EXEC Params */
> -       find_matching_subplans_recurse(prunedata, pprune, true, &result);
> +   if (parentrelids)
> +       *parentrelids = NULL;
>
> You got two blank lines after Assert.

Fixed.

> --
>
> +   /* Set up EState if not in the executor proper. */
> +   if (estate == NULL)
> +   {
> +       estate = CreateExecutorState();
> +       estate->es_param_list_info = params;
> +       free_estate = true;
>     }
>
> ... [Skip]
>
> +   if (free_estate)
> +   {
> +       FreeExecutorState(estate);
> +       estate = NULL;
>     }
>
> I think this work should be left to the caller.

Done.  Also see below...

>     /*
>      * Stuff that follows matches exactly what ExecCreatePartitionPruneState()
>      * does, except we don't need a PartitionPruneState here, so don't call
>      * that function.
>      *
>      * XXX some refactoring might be good.
>      */
>
> +1, while doing it would be nice if foreach_current_index() is used
> instead of the i & j sequence in the respective foreach() block, IMO.

Actually, I rewrote this part quite significantly so that most of the
code remains in its existing place.  I decided to let
GetLockableRelations_walker() create a PartitionPruneState and pass
that to ExecFindInitialMatchingSubPlans() that is now left more or
less as is.  Instead, ExecCreatePartitionPruneState() is changed to be
callable from outside the executor.

The temporary EState is no longer necessary.  ExprContext,
PartitionDirectory, etc. are now managed in the caller,
GetLockableRelations_walker().

> --
>
> +                   while ((i = bms_next_member(validsubplans, i)) >= 0)
> +                   {
> +                       Plan   *subplan = list_nth(subplans, i);
> +
> +                       context->relations =
> +                           bms_add_members(context->relations,
> +                                           get_plan_scanrelids(subplan));
> +                   }
>
> I think instead of get_plan_scanrelids() the
> GetLockableRelations_worker() can be used; if so, then no need to add
> get_plan_scanrelids() function.

You're right, done.

> --
>
>      /* Nodes containing prunable subnodes. */
> +       case T_MergeAppend:
> +           {
> +               PlannedStmt *plannedstmt = context->plannedstmt;
> +               List       *rtable = plannedstmt->rtable;
> +               ParamListInfo params = context->params;
> +               PartitionPruneInfo *pruneinfo;
> +               Bitmapset  *validsubplans;
> +               Bitmapset  *parentrelids;
>
> ...
>                 if (pruneinfo && pruneinfo->contains_init_steps)
>                 {
>                     int     i;
> ...
>                    return false;
>                 }
>             }
>             break;
>
> Most of the declarations need to be moved inside the if-block.

Done.

> Also, initially, I was a bit concerned regarding this code block
> inside GetLockableRelations_worker(), what if (pruneinfo &&
> pruneinfo->contains_init_steps) evaluated to false? After debugging I
> realized that plan_tree_walker() will do the needful -- a bit of
> comment would have helped.

You're right.  Added a dummy else {} block with just the comment saying so.

> +       case T_CustomScan:
> +           foreach(lc, ((CustomScan *) plan)->custom_plans)
> +           {
> +               if (walker((Plan *) lfirst(lc), context))
> +                   return true;
> +           }
> +           break;
>
> Why not plan_walk_members() call like other nodes?

Makes sense, done.

Again, most/all of this patch might need to be thrown away, but here
it is anyway.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v2-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

18 January 2022, 01:32:57

On Fri, Jan 14, 2022 at 11:10 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
> > Here are few comments for v1 patch:
>
> Thanks Amul.  I'm thinking about Robert's latest comments addressing
> which may need some rethinking of this whole design, but I decided to
> post a v2 taking care of your comments.

cfbot tells me there is an unused variable warning, which is fixed in
the attached v3.


--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v3-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch

Re: generic plans and "initial" pruning

From

Simon Riggs

Date:

18 January 2022, 07:44:48

On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:

> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example.

I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.

Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.

Please excuse me if this idea falls short, if so, please just note my
comment about how important this is. Thanks.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

18 January 2022, 08:10:22

Hi Simon,

On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
> > This is just a relatively simple example and I think there are
> > probably a bunch of others. There are a lot of kinds of DDL that could
> > be performed on a partition that gets pruned away: DROP INDEX is just
> > one example.
>
> I haven't followed this in any detail, but this patch and its goal of
> reducing the O(N) drag effect on partition execution time is very
> important. Locking a long list of objects that then get pruned is very
> wasteful, as the results show.
>
> Ideally, we want an O(1) algorithm for single partition access and DDL
> is rare. So perhaps that is the starting point for a safe design -
> invent a single lock or cache that allows us to check if the partition
> hierarchy has changed in any way, and if so, replan, if not, skip
> locks.

Rearchitecting partition locking to be O(1) seems like a project of
non-trivial complexity as Robert mentioned in a related email thread
couple of years ago:

https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com

Pursuing that kind of a project would perhaps have been more
worthwhile if the locking issue had affected more than just this
particular case, that is, the case of running prepared statements over
partitioned tables using generic plans.  Addressing this by
rearchitecting run-time pruning (and plancache to some degree) seemed
like it might lead to this getting fixed in a bounded timeframe.  I
admit that the concerns that Robert has raised about the patch make me
want to reconsider that position, though maybe it's too soon to
conclude.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Simon Riggs

Date:

18 January 2022, 10:28:20

On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Simon,
>
> On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
> > > This is just a relatively simple example and I think there are
> > > probably a bunch of others. There are a lot of kinds of DDL that could
> > > be performed on a partition that gets pruned away: DROP INDEX is just
> > > one example.
> >
> > I haven't followed this in any detail, but this patch and its goal of
> > reducing the O(N) drag effect on partition execution time is very
> > important. Locking a long list of objects that then get pruned is very
> > wasteful, as the results show.
> >
> > Ideally, we want an O(1) algorithm for single partition access and DDL
> > is rare. So perhaps that is the starting point for a safe design -
> > invent a single lock or cache that allows us to check if the partition
> > hierarchy has changed in any way, and if so, replan, if not, skip
> > locks.
>
> Rearchitecting partition locking to be O(1) seems like a project of
> non-trivial complexity as Robert mentioned in a related email thread
> couple of years ago:
>
> https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com

I agree, completely redesigning locking is a major project. But that
isn't what I suggested, which was to find an O(1) algorithm to solve
the safety issue. I'm sure there is an easy way to check one lock,
maybe a new one/new kind, rather than N.

Why does the safety issue exist? Why is it important to be able to
concurrently access parts of the hierarchy with DDL? Those are not
critical points.

If we asked them, most users would trade a 10x performance gain for
some restrictions on DDL. If anyone cares, make it an option, but most
people will use it.

Maybe force all DDL, or just DDL that would cause safety issues, to
update a hierarchy version number, so queries can tell whether they
need to replan. Don't know, just looking for an O(1) solution.

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

18 January 2022, 14:53:05

On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Pursuing that kind of a project would perhaps have been more
> worthwhile if the locking issue had affected more than just this
> particular case, that is, the case of running prepared statements over
> partitioned tables using generic plans.  Addressing this by
> rearchitecting run-time pruning (and plancache to some degree) seemed
> like it might lead to this getting fixed in a bounded timeframe.  I
> admit that the concerns that Robert has raised about the patch make me
> want to reconsider that position, though maybe it's too soon to
> conclude.

I wasn't trying to say that your approach was dead in the water. It
does create a situation that can't happen today, and such things are
scary and need careful thought. But redesigning the locking mechanism
would need careful thought, too ... maybe even more of it than sorting
this out.

I do also agree with Simon that this is an important problem to which
we need to find some solution.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

19 January 2022, 08:30:59

On Tue, Jan 18, 2022 at 7:28 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
> > <simon.riggs@enterprisedb.com> wrote:
> > > I haven't followed this in any detail, but this patch and its goal of
> > > reducing the O(N) drag effect on partition execution time is very
> > > important. Locking a long list of objects that then get pruned is very
> > > wasteful, as the results show.
> > >
> > > Ideally, we want an O(1) algorithm for single partition access and DDL
> > > is rare. So perhaps that is the starting point for a safe design -
> > > invent a single lock or cache that allows us to check if the partition
> > > hierarchy has changed in any way, and if so, replan, if not, skip
> > > locks.
> >
> > Rearchitecting partition locking to be O(1) seems like a project of
> > non-trivial complexity as Robert mentioned in a related email thread
> > couple of years ago:
> >
> > https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com
>
> I agree, completely redesigning locking is a major project. But that
> isn't what I suggested, which was to find an O(1) algorithm to solve
> the safety issue. I'm sure there is an easy way to check one lock,
> maybe a new one/new kind, rather than N.

I misread your email then, sorry.

> Why does the safety issue exist? Why is it important to be able to
> concurrently access parts of the hierarchy with DDL? Those are not
> critical points.
>
> If we asked them, most users would trade a 10x performance gain for
> some restrictions on DDL. If anyone cares, make it an option, but most
> people will use it.
>
> Maybe force all DDL, or just DDL that would cause safety issues, to
> update a hierarchy version number, so queries can tell whether they
> need to replan. Don't know, just looking for an O(1) solution.

Yeah, it would be great if it would suffice to take a single lock on
the partitioned table mentioned in the query, rather than on all
elements of the partition tree added to the plan.  AFAICS, ways to get
that are 1) Prevent modifying non-root partition tree elements, 2)
Make it so that locking a partitioned table becomes a proxy for having
locked all of its descendents, 3) Invent a Plan representation for
scanning partitioned tables such that adding the descendent tables
that survive plan-time pruning to the plan doesn't require locking
them too.  IIUC, you've mentioned 1 and 2.  I think I've seen 3
mentioned in the past discussions on this topic, but I guess the
research on whether that's doable has never been done.


--
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

19 January 2022, 08:31:45

On Tue, Jan 18, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Pursuing that kind of a project would perhaps have been more
> > worthwhile if the locking issue had affected more than just this
> > particular case, that is, the case of running prepared statements over
> > partitioned tables using generic plans.  Addressing this by
> > rearchitecting run-time pruning (and plancache to some degree) seemed
> > like it might lead to this getting fixed in a bounded timeframe.  I
> > admit that the concerns that Robert has raised about the patch make me
> > want to reconsider that position, though maybe it's too soon to
> > conclude.
>
> I wasn't trying to say that your approach was dead in the water. It
> does create a situation that can't happen today, and such things are
> scary and need careful thought. But redesigning the locking mechanism
> would need careful thought, too ... maybe even more of it than sorting
> this out.

Yes, agreed.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Simon Riggs

Date:

19 January 2022, 11:16:44

On Wed, 19 Jan 2022 at 08:31, Amit Langote <amitlangote09@gmail.com> wrote:

> > Maybe force all DDL, or just DDL that would cause safety issues, to
> > update a hierarchy version number, so queries can tell whether they
> > need to replan. Don't know, just looking for an O(1) solution.
>
> Yeah, it would be great if it would suffice to take a single lock on
> the partitioned table mentioned in the query, rather than on all
> elements of the partition tree added to the plan.  AFAICS, ways to get
> that are 1) Prevent modifying non-root partition tree elements,

Can we reuse the concept of Strong/Weak locking here?

When a DDL request is in progress (for that partitioned table), take
all required locks for safety. When a DDL request is not in progress,
take minimal locks knowing it is safe.

We can take a single PartitionTreeModificationLock, nowait to prove
that we do not need all locks. DDL would request the lock in exclusive
mode. (Other mechanisms possible).

-- 
Simon Riggs                http://www.EnterpriseDB.com/

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

10 February 2022, 08:13:52

On Thu, Jan 13, 2022 at 3:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Or, maybe this won't be a concern if performing ExecutorStart() is
> > made a part of CheckCachedPlan() somehow, which would then take locks
> > on the relation as the PlanState tree is built capturing any plan
> > invalidations, instead of AcquireExecutorLocks(). That does sound like
> > an ambitious undertaking though.
>
> On the surface that would seem to involve abstraction violations, but
> maybe that could be finessed somehow. The plancache shouldn't know too
> much about what the executor is going to do with the plan, but it
> could ask the executor to perform a step that has been designed for
> use by the plancache. I guess the core problem here is how to pass
> around information that is node-specific before we've stood up the
> executor state tree. Maybe the executor could have a function that
> does the pruning and returns some kind of array of results that can be
> used both to decide what to lock and also what to consider as pruned
> at the start of execution. (I'm hand-waving about the details because
> I don't know.)

The attached patch implements this idea.  Sorry for the delay in
getting this out and thanks to Robert for the off-list discussions on
this.

So the new executor "step" you mention is the function ExecutorPrep in
the patch, which calls a recursive function ExecPrepNode on the plan
tree's top node, much as ExecutorStart calls (via InitPlan)
ExecInitNode to construct a PlanState tree for actual execution
paralleling the plan tree.

For now, ExecutorPrep() / ExecPrepNode() does mainly two things if and
as it walks the plan tree: 1) Extract the RT indexes of RTE_RELATION
entries and add them to a bitmapset in the result struct, 2) If the
node contains a PartitionPruneInfo, perform its "initial pruning
steps" and store the result of doing so in a per-plan-node node called
PlanPrepOutput.  The bitmapset and the array containing per-plan-node
PlanPrepOutput nodes are returned in a node called ExecPrepOutput,
which is the result of ExecutorPrep, to its calling module (say,
plancache.c), which, after it's done using that information, must pass
it forward to subsequent execution steps.  That is done by passing it,
via the module's callers, to CreateQueryDesc() which remembers the
ExecPrepOutput in QueryDesc that is eventually passed to
ExecutorStart().

A bunch of other details are mentioned in the patch's commit message,
which I'm pasting below for anyone reading to spot any obvious flaws
(no-go's) of this approach:

    Invent a new executor "prep" phase

    The new phase, implemented by execMain.c:ExecutorPrep() and its
    recursive underling execProcnode.c:ExecPrepNode(), takes a query's
    PlannedStmt and processes the plan tree contained in it to produce
    a ExecPrepOutput node as result.

    As the plan tree is walked, each node must add the RT index(es) of
    any relation(s) that it directly manipulates to a bitmapset member of
    ExecPrepOutput (for example, an IndexScan node must add the Scan's
    scanrelid).  Also, each node may want to make a PlanPrepOutput node
    containing additional information that may be of interest to the
    calling module or to the later execution phases, if the node can
    provide one (for example, an Append node may perform initial pruning
    and add a set of "initially valid subplans" to the PlanPrepOutput).
    The PlanPrepOutput nodess of all the plan nodes are added to an array
    in the ExecPrepOutput, which is indexed using the individual nodes'
    plan_node_id; a NULL is stored in the array slots of nodes that
    don't have anything interesting to add to the PlanPrepOutput.

    The ExecPrepOutput thus produced is passed to CreateQueryDesc()
    and subsequently to ExecutorStart() via QueryDesc, which then makes
    it available to the executor routines via the query's EState.

    The main goal of adding this new phase is, for now, to allow cached
    cached generic plans containing scans of partitioned tables using
    Append/MergeAppend to be executed more efficiently by the prep phase
    doing any initial pruning, instead of deferring that to
    ExecutorStart().  That may allow AcquireExecutorLocks() on the plan
    to lock only only the minimal set of relations/partitions, that is
    those whose subplans survive the initial pruning.

    Implementation notes:

    * To allow initial pruning to be done as part of the pre-execution
    prep phase as opposed to as part of ExecutorStart(), this refactors
    ExecCreatePartitionPruneState() and ExecFindInitialMatchingSubPlans()
    to pass the information needed to do initial pruning directly as
    parameters instead of getting that from the EState and the PlanState
    of the parent Append/MergeAppend, both of which would not be
    available in ExecutorPrep().  Another, sort of non-essential-to-this-
    goal, refactoring this does is moving the partition pruning
    initialization stanzas in ExecInitAppend() and ExecInitMergeAppend()
    both of which contain the same cod into its own function
    ExecInitPartitionPruning().

    * To pass the ExecPrepOutput(s) created by the plancache module's
    invocation of ExecutorPrep() to the callers of the module, which in
    turn would pass them down to ExecutorStart(), CachedPlan gets a new
    List field that stores those ExecPrepOutputs, containing one element
    for each PlannedStmt also contained in the CachedPlan.  The new list
    is stored in a child context of the context containing the
    PlannedStmts, though unlike the latter, it is reset on every
    invocation of CheckCachedPlan(), which in turn calls ExecutorPrep()
    with a new set of bound Params.

    * AcquireExecutorLocks() is now made to loop over a bitmapset of RT
    indexes, those of relations returned in ExecPrepOutput, instead of
    over the whole range table.  With initial pruning that is also done
    as part of ExcecutorPrep(), only relations from non-pruned nodes of
    the plan tree would get locked as a result of this new arrangement.

    * PlannedStmt gets a new field usesPrepExecPruning that indicates
    whether any of the nodes of the plan tree contain "initial" (or
    "pre-execution") pruning steps, which saves ExecutorPrep() the
    trouble of walking the plan tree only to find out whether that's
    the case.

    * PartitionPruneInfo nodes now explicitly stores whether the steps
    contained in any of the individual PartitionedRelPruneInfos embedded
    in it contain initial pruning steps (those that can be performed
    during ExecutorPrep) and execution pruning steps (those that can only
    be performed during ExecutorRun), as flags contains_initial_steps and
    contains_exec_steps, respectively.  In fact, the aforementioned
    PlannedStmt field's value is a logical OR of the values of the former
    across all PartitionPruneInfo nodes embedded in the plan tree.

    * PlannedStmt also gets a bitmapset field to store the RT indexes of
    all relation RTEs referenced in the query that is populated when
    contructing the flat range table in setrefs.c, which effectively
    contains all the relations that the planner must have locked. In the
    case of a cached plan, AcquireExecutorLocks() must lock all of those
    relations, except those whose subnodes get pruned as result of
    ExecutorPrep().

    * PlannedStmt gets yet another field numPlanNodes that records the
    highest plan_node_id assigned to any of the node contained in the
    tree, which serves as the size to use when allocating the
    PlanPrepOutput array.

Maybe this should be more than one patch?  Say:

0001 to add ExecutorPrep and the boilerplate,
0002 to teach plancache.c to use the new facility

Thoughts?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v4-0001-Invent-a-new-executor-prep-phase.patch

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

10 February 2022, 22:01:52

On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Maybe this should be more than one patch?  Say:
>
> 0001 to add ExecutorPrep and the boilerplate,
> 0002 to teach plancache.c to use the new facility

Could be, not sure. I agree that if it's possible to split this in a
meaningful way, it would facilitate review. I notice that there is
some straight code movement e.g. the creation of
ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do
pure code movement in a preparatory patch so that the main patch is
just adding the new stuff we need and not moving stuff around.

David Rowley recently proposed a patch for some parallel-safety
debugging cross checks which added a plan tree walker. I'm not sure
whether he's going to press that patch forward to commit, but I think
we should get something like that into the tree and start using it,
rather than adding more bespoke code. Maybe you/we should steal that
part of his patch and commit it separately. What I'm imagining is that
plan_tree_walker() would know which nodes have subnodes and how to
recurse over the tree structure, and you'd have a walker function to
use with it that would know which executor nodes have ExecPrep
functions and call them, and just do nothing for the others. That
would spare you adding stub functions for nodes that don't need to do
anything, or don't need to do anything other than recurse. Admittedly
it would look a bit different from the existing executor phases, but
I'd argue that it's a better coding model.

Actually, you might've had this in the patch at some point, because
you have a declaration for plan_tree_walker but no implementation. I
guess one thing that's a bit awkward about this idea is that in some
cases you want to recurse to some subnodes but not other subnodes. But
maybe it would work to put the recursion in the walker function in
that case, and then just return true; but if you want to walk all
children, return false.

+ bool contains_init_steps;
+ bool contains_exec_steps;

s/steps/pruning/? maybe with contains -> needs or performs or requires as well?

+ * Returned information includes the set of RT indexes of relations referenced
+ * in the plan, and a PlanPrepOutput node for each node in the planTree if the
+ * node type supports producing one.

Aren't all RT indexes referenced in the plan?

+ * This may lock relations whose information may be used to produce the
+ * PlanPrepOutput nodes. For example, a partitioned table before perusing its
+ * PartitionPruneInfo contained in an Append node to do the pruning the result
+ * of which is used to populate the Append node's PlanPrepOutput.

"may lock" feels awfully fuzzy to me. How am I supposed to rely on
something that "may" happen? And don't we need to have tight logic
around locking, with specific guarantees about what is locked at which
points in the code and what is not?

+ * At least one of 'planstate' or 'econtext' must be passed to be able to
+ * successfully evaluate any non-Const expressions contained in the
+ * steps.

This also seems fuzzy. If I'm thinking of calling this function, I
don't know how I'd know whether this criterion is met.

I don't love PlanPrepOutput the way you have it. I think one of the
basic design issues for this patch is: should we think of the prep
phase as specifically pruning, or is it general prep and pruning is
the first thing for which we're going to use it? If it's really a
pre-pruning phase, we could name it that way instead of calling it
"prep". If it's really a general prep phase, then why does
PlanPrepOutput contain initially_valid_subnodes as a field? One could
imagine letting each prep function decide what kind of prep node it
would like to return, with partition pruning being just one of the
options. But is that a useful generalization of the basic concept, or
just pretending that a special-purpose mechanism is more general than
it really is?

+ return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */

It seems to me that we should do what the XXX suggests. It doesn't
seem nice if the parallel workers could theoretically decide to prune
a different set of nodes than the leader.

+ * known at executor startup (excludeing expressions containing

Extra e.

+ * into subplan indexes, is also returned for use during subsquent

Missing e.

Somewhere, we're going to need to document the idea that this may
permit us to execute a plan that isn't actually fully valid, but that
we expect to survive because we'll never do anything with the parts of
it that aren't. Maybe that should be added to the executor README, or
maybe there's some better place, but I don't think that should remain
something that's just implicit.

This is not a full review, just some initial thoughts looking through this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Andres Freund

Date:

11 February 2022, 01:29:35

Hi,

On 2022-02-10 17:13:52 +0900, Amit Langote wrote:
> The attached patch implements this idea.  Sorry for the delay in
> getting this out and thanks to Robert for the off-list discussions on
> this.

I did not follow this thread at all. And I only skimmed the patch. So I'm
probably wrong.

I'm a wary of this increasing executor overhead even in cases it won't
help. Without this patch, for simple queries, I see small allocations
noticeably in profiles. This adds a bunch more, even if
!context->stmt->usesPreExecPruning:

- makeNode(ExecPrepContext)
- makeNode(ExecPrepOutput)
- palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes)
- stmt_execprep_list = lappend(stmt_execprep_list, execprep);
- AllocSetContextCreate(CurrentMemoryContext,
  "CachedPlan execprep list", ...
- ...

That's a lot of extra for something that's already a bottleneck.

Greetings,

Andres Freund

Re: generic plans and "initial" pruning

From

David Rowley

Date:

13 February 2022, 21:55:16

(just catching up on this thread)

On Thu, 13 Jan 2022 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah. I don't think it's only non-core code we need to worry about
> either. What if I just do EXPLAIN ANALYZE on a prepared query that
> ends up pruning away some stuff? IIRC, the pruned subplans are not
> shown, so we might escape disaster here, but FWIW if I'd committed
> that code I would have pushed hard for showing those and saying "(not
> executed)" .... so it's not too crazy to imagine a world in which
> things work that way.

FWIW, that would remove the whole point in init run-time pruning.  The
reason I made two phases of run-time pruning was so that we could get
away from having the init plan overhead of nodes we'll never need to
scan.  If we wanted to show the (never executed) scans in EXPLAIN then
we'd need to do the init plan part and allocate all that memory
needlessly.

Imagine a hash partitioned table on "id" with 1000 partitions. The user does:

PREPARE q1 (INT) AS SELECT * FROM parttab WHERE id = $1;

EXECUTE q1(123);

Assuming a generic plan, if we didn't have init pruning then we have
to build a plan containing the scans for all 1000 partitions. There's
significant overhead to that compared to just locking the partitions,
and initialising 1 scan.

If it worked this way then we'd be even further from Amit's goal of
reducing the overhead of starting plan with run-time pruning nodes.

I understood at the time it was just the EXPLAIN output that you had
concerns with. I thought that was just around the lack of any display
of the condition we used for pruning.

David

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

14 February 2022, 20:17:48

On Sun, Feb 13, 2022 at 4:55 PM David Rowley <dgrowleyml@gmail.com> wrote:
> FWIW, that would remove the whole point in init run-time pruning.  The
> reason I made two phases of run-time pruning was so that we could get
> away from having the init plan overhead of nodes we'll never need to
> scan.  If we wanted to show the (never executed) scans in EXPLAIN then
> we'd need to do the init plan part and allocate all that memory
> needlessly.

Interesting. I didn't realize that was why it had ended up like this.

> I understood at the time it was just the EXPLAIN output that you had
> concerns with. I thought that was just around the lack of any display
> of the condition we used for pruning.

That was part of it, but I did think it was surprising that we didn't
print anything at all about the nodes we pruned, too. Although we're
technically iterating over the PlanState, from the user perspective it
feels like you're asking PostgreSQL to print out the plan - so it
seems weird to have nodes in the Plan tree that are quietly omitted
from the output. That said, perhaps in retrospect it's good that it
ended up as it did, since we'd have a lot of trouble printing anything
sensible for a scan of a table that's since been dropped.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

28 February 2022, 06:04:14

Hi Andres,

On Fri, Feb 11, 2022 at 10:29 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-10 17:13:52 +0900, Amit Langote wrote:
> > The attached patch implements this idea.  Sorry for the delay in
> > getting this out and thanks to Robert for the off-list discussions on
> > this.
>
> I did not follow this thread at all. And I only skimmed the patch. So I'm
> probably wrong.

Thanks for your interest in this and sorry about the delay in replying
(have been away due to illness).

> I'm a wary of this increasing executor overhead even in cases it won't
> help. Without this patch, for simple queries, I see small allocations
> noticeably in profiles. This adds a bunch more, even if
> !context->stmt->usesPreExecPruning:

Ah, if any new stuff added by the patch runs in
!context->stmt->usesPreExecPruning paths, then it's just poor coding
on my part, which I'm now looking to fix.  Maybe not all of it is
avoidable, but I think whatever isn't should be trivial...

> - makeNode(ExecPrepContext)
> - makeNode(ExecPrepOutput)
> - palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes)
> - stmt_execprep_list = lappend(stmt_execprep_list, execprep);
> - AllocSetContextCreate(CurrentMemoryContext,
>   "CachedPlan execprep list", ...
> - ...
>
> That's a lot of extra for something that's already a bottleneck.

If all these allocations are limited to the usesPreExecPruning path,
IMO, they would amount to trivial overhead compared to what is going
to be avoided -- locking say 1000 partitions when only 1 will be
scanned.  Although, maybe there's a way to code this to have even less
overhead than what's in the patch now.
--
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

07 March 2022, 14:18:33

On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Maybe this should be more than one patch?  Say:
> >
> > 0001 to add ExecutorPrep and the boilerplate,
> > 0002 to teach plancache.c to use the new facility

Thanks for taking a look and sorry about the delay.

> Could be, not sure. I agree that if it's possible to split this in a
> meaningful way, it would facilitate review. I notice that there is
> some straight code movement e.g. the creation of
> ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do
> pure code movement in a preparatory patch so that the main patch is
> just adding the new stuff we need and not moving stuff around.

Okay, created 0001 for moving around the execution pruning code.

> David Rowley recently proposed a patch for some parallel-safety
> debugging cross checks which added a plan tree walker. I'm not sure
> whether he's going to press that patch forward to commit, but I think
> we should get something like that into the tree and start using it,
> rather than adding more bespoke code. Maybe you/we should steal that
> part of his patch and commit it separately.

I looked at the thread you mentioned (I guess [1]), though it seems
David's proposing a path_tree_walker(), so I guess only useful within
the planner and not here.

> What I'm imagining is that
> plan_tree_walker() would know which nodes have subnodes and how to
> recurse over the tree structure, and you'd have a walker function to
> use with it that would know which executor nodes have ExecPrep
> functions and call them, and just do nothing for the others. That
> would spare you adding stub functions for nodes that don't need to do
> anything, or don't need to do anything other than recurse. Admittedly
> it would look a bit different from the existing executor phases, but
> I'd argue that it's a better coding model.
>
> Actually, you might've had this in the patch at some point, because
> you have a declaration for plan_tree_walker but no implementation.

Right, the previous patch indeed used a plan_tree_walker() for this
and I think in a way you seem to think it should work.

I do agree that plan_tree_walker() allows for a better implementation
of the idea of this patch and may also be generally useful, so I've
created a separate patch that adds it to nodeFuncs.c.

> I guess one thing that's a bit awkward about this idea is that in some
> cases you want to recurse to some subnodes but not other subnodes. But
> maybe it would work to put the recursion in the walker function in
> that case, and then just return true; but if you want to walk all
> children, return false.

Right, that's how I've made ExecPrepAppend() etc. do it.

> + bool contains_init_steps;
> + bool contains_exec_steps;
>
> s/steps/pruning/? maybe with contains -> needs or performs or requires as well?

Went with: needs_{init|exec}_pruning

> + * Returned information includes the set of RT indexes of relations referenced
> + * in the plan, and a PlanPrepOutput node for each node in the planTree if the
> + * node type supports producing one.
>
> Aren't all RT indexes referenced in the plan?

Ah yes.  How about:

 * Returned information includes the set of RT indexes of relations that must
 * be locked to safely execute the plan,

> + * This may lock relations whose information may be used to produce the
> + * PlanPrepOutput nodes. For example, a partitioned table before perusing its
> + * PartitionPruneInfo contained in an Append node to do the pruning the result
> + * of which is used to populate the Append node's PlanPrepOutput.
>
> "may lock" feels awfully fuzzy to me. How am I supposed to rely on
> something that "may" happen? And don't we need to have tight logic
> around locking, with specific guarantees about what is locked at which
> points in the code and what is not?

Agree the wording was fuzzy.  I've rewrote as:

 * This locks relations whose information is needed to produce the
 * PlanPrepOutput nodes. For example, a partitioned table before perusing its
 * PartitionedRelPruneInfo contained in an Append node to do the pruning, the
 * result of which is used to populate the Append node's PlanPrepOutput.

BTW, I've added an Assert in ExecGetRangeTableRelation():

   /*
    * A cross-check that AcquireExecutorLocks() hasn't missed any relations
    * it must not have.
    */
   Assert(estate->es_execprep == NULL ||
          bms_is_member(rti, estate->es_execprep->relationRTIs));

which IOW ensures that the actual execution of a plan only sees
relations that ExecutorPrep() would've told AcquireExecutorLocks() to
take a lock on.

> + * At least one of 'planstate' or 'econtext' must be passed to be able to
> + * successfully evaluate any non-Const expressions contained in the
> + * steps.
>
> This also seems fuzzy. If I'm thinking of calling this function, I
> don't know how I'd know whether this criterion is met.

OK, I have removed this comment (which was on top of a static local
function) in favor of adding some commentary on this in places where
it belongs.  For example, in ExecPrepDoInitialPruning():

    /*
     * We don't yet have a PlanState for the parent plan node, so must create
     * a standalone ExprContext to evaluate pruning expressions, equipped with
     * the information about the EXTERN parameters that the caller passed us.
     * Note that that's okay because the initial pruning steps does not
     * involve anything that requires the execution to have started.
     */
    econtext = CreateStandaloneExprContext();
    econtext->ecxt_param_list_info = params;
    prunestate = ExecCreatePartitionPruneState(NULL, pruneinfo,
                                               true, false,
                                               rtable, econtext,
                                               pdir, parentrelids);

> I don't love PlanPrepOutput the way you have it. I think one of the
> basic design issues for this patch is: should we think of the prep
> phase as specifically pruning, or is it general prep and pruning is
> the first thing for which we're going to use it? If it's really a
> pre-pruning phase, we could name it that way instead of calling it
> "prep". If it's really a general prep phase, then why does
> PlanPrepOutput contain initially_valid_subnodes as a field? One could
> imagine letting each prep function decide what kind of prep node it
> would like to return, with partition pruning being just one of the
> options. But is that a useful generalization of the basic concept, or
> just pretending that a special-purpose mechanism is more general than
> it really is?

While it can feel like the latter TBH, I'm inclined to keep
ExecutorPrep generalized.   What bothers me about about the
alternative of calling the new phase something less generalized like
ExecutorDoInitPruning() is that that makes the somewhat elaborate API
changes needed for the phase's output to put into QueryDesc, through
which it ultimately reaches the main executor, seem less worthwhile.

I agree that PlanPrepOutput design needs to be likewise generalized,
maybe like you suggest -- using PlanInitPruningOutput, a child class
of PlanPrepOutput, to return the prep output for plan nodes that
support pruning.

Thoughts?

> + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */
>
> It seems to me that we should do what the XXX suggests. It doesn't
> seem nice if the parallel workers could theoretically decide to prune
> a different set of nodes than the leader.

OK, will fix.

> + * known at executor startup (excludeing expressions containing
>
> Extra e.
>
> + * into subplan indexes, is also returned for use during subsquent
>
> Missing e.

Will fix.

> Somewhere, we're going to need to document the idea that this may
> permit us to execute a plan that isn't actually fully valid, but that
> we expect to survive because we'll never do anything with the parts of
> it that aren't. Maybe that should be added to the executor README, or
> maybe there's some better place, but I don't think that should remain
> something that's just implicit.

Agreed.  I'd added a description of the new prep phase to executor
README, though the text didn't mention this particular bit.  Will fix
to mention it.

> This is not a full review, just some initial thoughts looking through this.

Thanks again. Will post a new version soon after a bit more polishing.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/flat/b59605fecb20ba9ea94e70ab60098c237c870628.camel%40postgrespro.ru

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

11 March 2022, 14:35:37

On Mon, Mar 7, 2022 at 11:18 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't love PlanPrepOutput the way you have it. I think one of the
> > basic design issues for this patch is: should we think of the prep
> > phase as specifically pruning, or is it general prep and pruning is
> > the first thing for which we're going to use it? If it's really a
> > pre-pruning phase, we could name it that way instead of calling it
> > "prep". If it's really a general prep phase, then why does
> > PlanPrepOutput contain initially_valid_subnodes as a field? One could
> > imagine letting each prep function decide what kind of prep node it
> > would like to return, with partition pruning being just one of the
> > options. But is that a useful generalization of the basic concept, or
> > just pretending that a special-purpose mechanism is more general than
> > it really is?
>
> While it can feel like the latter TBH, I'm inclined to keep
> ExecutorPrep generalized.   What bothers me about about the
> alternative of calling the new phase something less generalized like
> ExecutorDoInitPruning() is that that makes the somewhat elaborate API
> changes needed for the phase's output to put into QueryDesc, through
> which it ultimately reaches the main executor, seem less worthwhile.
>
> I agree that PlanPrepOutput design needs to be likewise generalized,
> maybe like you suggest -- using PlanInitPruningOutput, a child class
> of PlanPrepOutput, to return the prep output for plan nodes that
> support pruning.
>
> Thoughts?

So I decided to agree with you after all about limiting the scope of
this new executor interface, or IOW call it what it is.

I have named it ExecutorGetLockRels() to go with the only use case we
know for it -- get the set of relations for AcquireExecutorLocks() to
lock to validate a plan tree.  Its result returned in a node named
ExecLockRelsInfo, which contains the set of relations scanned in the
plan tree (lockrels) and a list of PlanInitPruningOutput nodes for all
nodes that undergo pruning.

> > + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */
> >
> > It seems to me that we should do what the XXX suggests. It doesn't
> > seem nice if the parallel workers could theoretically decide to prune
> > a different set of nodes than the leader.
>
> OK, will fix.

Done.  This required adding nodeToString() and stringToNode() support
for the nodes produced by the new executor function that wasn't there
before.

> > Somewhere, we're going to need to document the idea that this may
> > permit us to execute a plan that isn't actually fully valid, but that
> > we expect to survive because we'll never do anything with the parts of
> > it that aren't. Maybe that should be added to the executor README, or
> > maybe there's some better place, but I don't think that should remain
> > something that's just implicit.
>
> Agreed.  I'd added a description of the new prep phase to executor
> README, though the text didn't mention this particular bit.  Will fix
> to mention it.

Rewrote the comments above ExecutorGetLockRels() (previously
ExecutorPrep()) and the executor README text to be explicit about the
fact that not locking some relations effectively invalidates pruned
parts of the plan tree.

> > This is not a full review, just some initial thoughts looking through this.
>
> Thanks again. Will post a new version soon after a bit more polishing.

Attached is v5, now broken into 3 patches:

0001: Some refactoring of runtime pruning code
0002: Add a plan_tree_walker
0003: Teach AcquireExecutorLocks to skip locking pruned relations

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

11 March 2022, 15:06:34

On Fri, Mar 11, 2022 at 11:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached is v5, now broken into 3 patches:
>
> 0001: Some refactoring of runtime pruning code
> 0002: Add a plan_tree_walker
> 0003: Teach AcquireExecutorLocks to skip locking pruned relations

Repeated the performance tests described in the 1st email of this thread:

HEAD: (copied from the 1st email)

32      tps = 20561.776403 (without initial connection time)
64      tps = 12553.131423 (without initial connection time)
128     tps = 13330.365696 (without initial connection time)
256     tps = 8605.723120 (without initial connection time)
512     tps = 4435.951139 (without initial connection time)
1024    tps = 2346.902973 (without initial connection time)
2048    tps = 1334.680971 (without initial connection time)

Patched v1: (copied from the 1st email)

32      tps = 27554.156077 (without initial connection time)
64      tps = 27531.161310 (without initial connection time)
128     tps = 27138.305677 (without initial connection time)
256     tps = 25825.467724 (without initial connection time)
512     tps = 19864.386305 (without initial connection time)
1024    tps = 18742.668944 (without initial connection time)
2048    tps = 16312.412704 (without initial connection time)

Patched v5:

32      tps = 28204.197738 (without initial connection time)
64      tps = 26795.385318 (without initial connection time)
128     tps = 26387.920550 (without initial connection time)
256     tps = 25601.141556 (without initial connection time)
512     tps = 19911.947502 (without initial connection time)
1024    tps = 20158.692952 (without initial connection time)
2048    tps = 16180.195463 (without initial connection time)

Good to see that these rewrites haven't really hurt the numbers much,
which makes sense because the rewrites have really been about putting
the code in the right place.

BTW, these are the numbers for the same benchmark repeated with
plan_cache_mode = auto, which causes a custom plan to be chosen for
every execution and so unaffected by this patch.

32      tps = 13359.225082 (without initial connection time)
64      tps = 15760.533280 (without initial connection time)
128     tps = 15825.734482 (without initial connection time)
256     tps = 15017.693905 (without initial connection time)
512     tps = 13479.973395 (without initial connection time)
1024    tps = 13200.444397 (without initial connection time)
2048    tps = 12884.645475 (without initial connection time)

Comparing them to numbers when using force_generic_plan shows that
making the generic plans faster is indeed worthwhile.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Zhihong Yu

Date:

11 March 2022, 22:09:16

Hi,

w.r.t. v5-0003-Teach-AcquireExecutorLocks-to-skip-locking-pruned.patch :

(pruning steps containing expressions that can be computed before
before the executor proper has started)

the word 'before' was repeated.

For ExecInitParallelPlan():

+ char *execlockrelsinfo_data;
+ char *execlockrelsinfo_space;

the content of execlockrelsinfo_data is copied into execlockrelsinfo_space.

I wonder if having one of execlockrelsinfo_data and execlockrelsinfo_space suffices.

Cheers

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

14 March 2022, 18:42:19

On Fri, Mar 11, 2022 at 9:35 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached is v5, now broken into 3 patches:
>
> 0001: Some refactoring of runtime pruning code
> 0002: Add a plan_tree_walker
> 0003: Teach AcquireExecutorLocks to skip locking pruned relations

So is any other committer planning to look at this? Tom, perhaps?
David? This strikes me as important work, and I don't mind going
through and trying to do some detailed review, but (A) I am not the
person most familiar with the code being modified here and (B) there
are some important theoretical questions about the approach that we
might want to try to cover before we get down into the details.

In my opinion, the most important theoretical issue here is around
reuse of plans that are no longer entirely valid, but the parts that
are no longer valid are certain to be pruned. If, because we know that
some parameter has some particular value, we skip locking a bunch of
partitions, then when we're executing the plan, those partitions need
not exist any more -- or they could have different indexes, be
detached from the partitioning hierarchy and subsequently altered,
whatever. That seems fine to me provided that all of our code (and any
third-party code) is careful not to rely on the portion of the plan
that we've pruned away, and doesn't assume that (for example) we can
still fetch the name of an index whose OID appears in there someplace.
I cannot think of a hazard where the fact that the part of a plan is
no longer valid because some DDL has been executed "infects" the
remainder of the plan. As long as we lock the partitioned tables named
in the plan and their descendents down to the level just above the one
at which something is pruned, and are careful, I think we should be
OK. It would be nice to know if someone has a fundamentally different
view of the hazards here, though.

Just to state my position here clearly, I would be more than happy if
somebody else plans to pick this up and try to get some or all of it
committed, and will cheerfully defer to such person in the event that
they have that plan. If, however, no such person exists, I may try my
hand at that myself.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Tom Lane

Date:

14 March 2022, 19:38:23

Robert Haas <robertmhaas@gmail.com> writes:
> In my opinion, the most important theoretical issue here is around
> reuse of plans that are no longer entirely valid, but the parts that
> are no longer valid are certain to be pruned. If, because we know that
> some parameter has some particular value, we skip locking a bunch of
> partitions, then when we're executing the plan, those partitions need
> not exist any more -- or they could have different indexes, be
> detached from the partitioning hierarchy and subsequently altered,
> whatever.

Check.

> That seems fine to me provided that all of our code (and any
> third-party code) is careful not to rely on the portion of the plan
> that we've pruned away, and doesn't assume that (for example) we can
> still fetch the name of an index whose OID appears in there someplace.

... like EXPLAIN, for example?

If "pruning" means physical removal from the plan tree, then it's
probably all right.  However, it looks to me like that doesn't
actually happen, or at least doesn't happen till much later, so
there's room for worry about a disconnect between what plancache.c
has verified and what executor startup will try to touch.  As you
say, in the absence of any bugs, that's not a problem ... but if
there are such bugs, tracking them down would be really hard.

What I am skeptical about is that this work actually accomplishes
anything under real-world conditions.  That's because if pruning would
save enough to make skipping the lock-acquisition phase worth the
trouble, the plan cache is almost certainly going to decide it should
be using a custom plan not a generic plan.  Now if we had a better
cost model (or, indeed, any model at all) for run-time pruning effects
then maybe that situation could be improved.  I think we'd be better
served to worry about that end of it before we spend more time making
the executor even less predictable.

Also, while I've not spent much time at all reading this patch,
it seems rather desperately undercommented, and a lot of the
new names are unintelligible.  In particular, I suspect that the
patch is significantly redesigning when/where run-time pruning
happens (unless it's just letting that be run twice); but I don't
see any documentation or name changes suggesting where that
responsibility is now.

            regards, tom lane

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

14 March 2022, 20:06:22

On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... like EXPLAIN, for example?

Exactly! I think that's the foremost example, but extension modules
like auto_explain or even third-party extensions are also a risk. I
think there was some discussion of this previously.

> If "pruning" means physical removal from the plan tree, then it's
> probably all right.  However, it looks to me like that doesn't
> actually happen, or at least doesn't happen till much later, so
> there's room for worry about a disconnect between what plancache.c
> has verified and what executor startup will try to touch.  As you
> say, in the absence of any bugs, that's not a problem ... but if
> there are such bugs, tracking them down would be really hard.

Surgery on the plan would violate the general principle that plans are
read only once constructed. I think the idea ought to be to pass a
secondary data structure around with the plan that defines which parts
you must ignore. Any code that fails to use that other data structure
in the appropriate manner gets defined to be buggy and has to be fixed
by making it follow the new rules.

> What I am skeptical about is that this work actually accomplishes
> anything under real-world conditions.  That's because if pruning would
> save enough to make skipping the lock-acquisition phase worth the
> trouble, the plan cache is almost certainly going to decide it should
> be using a custom plan not a generic plan.  Now if we had a better
> cost model (or, indeed, any model at all) for run-time pruning effects
> then maybe that situation could be improved.  I think we'd be better
> served to worry about that end of it before we spend more time making
> the executor even less predictable.

I don't agree with that analysis, because setting plan_cache_mode is
not uncommon. Even if that GUC didn't exist, I'm pretty sure there are
cases where the planner naturally falls into a generic plan anyway,
even though pruning is happening. But as it is, the GUC does exist,
and people use it. Consequently, while I'd love to see something done
about the costing side of things, I do not accept that all other
improvements should wait for that to happen.

> Also, while I've not spent much time at all reading this patch,
> it seems rather desperately undercommented, and a lot of the
> new names are unintelligible.  In particular, I suspect that the
> patch is significantly redesigning when/where run-time pruning
> happens (unless it's just letting that be run twice); but I don't
> see any documentation or name changes suggesting where that
> responsibility is now.

I am sympathetic to that concern. I spent a while staring at a
baffling comment in 0001 only to discover it had just been moved from
elsewhere. I really don't feel that things in this are as clear as
they could be -- although I hasten to add that I respect the people
who have done work in this area previously and am grateful for what
they did. It's been a huge benefit to the project in spite of the
bumps in the road. Moreover, this isn't the only code in PostgreSQL
that needs improvement, or the worst. That said, I do think there are
problems. I don't yet have a position on whether this patch is making
that better or worse.

That said, I believe that the core idea of the patch is to optionally
perform pruning before we acquire locks or spin up the main executor
and then remember the decisions we made. If once the main executor is
spun up we already made those decisions, then we must stick with what
we decided. If not, we make those pruning decisions at the same point
we do currently - more or less on demand, at the point when we'd need
to know whether to descend that branch of the plan tree or not. I
think this scheme comes about because there are a couple of different
interfaces to the parameterized query stuff, and in some code paths we
have the values early enough to use them for pre-pruning, and in
others we don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

15 March 2022, 06:19:00

On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > What I am skeptical about is that this work actually accomplishes
> > anything under real-world conditions.  That's because if pruning would
> > save enough to make skipping the lock-acquisition phase worth the
> > trouble, the plan cache is almost certainly going to decide it should
> > be using a custom plan not a generic plan.  Now if we had a better
> > cost model (or, indeed, any model at all) for run-time pruning effects
> > then maybe that situation could be improved.  I think we'd be better
> > served to worry about that end of it before we spend more time making
> > the executor even less predictable.
>
> I don't agree with that analysis, because setting plan_cache_mode is
> not uncommon. Even if that GUC didn't exist, I'm pretty sure there are
> cases where the planner naturally falls into a generic plan anyway,
> even though pruning is happening. But as it is, the GUC does exist,
> and people use it. Consequently, while I'd love to see something done
> about the costing side of things, I do not accept that all other
> improvements should wait for that to happen.

I agree that making generic plans execute faster has merit even before
we make the costing changes to allow plancache.c prefer generic plans
over custom ones in these cases.  As the numbers in my previous email
show, simply executing a generic plan with the proposed improvements
applied is significantly cheaper than having the planner do the
pruning on every execution:

nparts      auto/custom     generic
======      ==========      ======
32          13359           28204
64          15760           26795
128         15825           26387
256         15017           25601
512         13479           19911
1024        13200           20158
2048        12884           16180

> > Also, while I've not spent much time at all reading this patch,
> > it seems rather desperately undercommented, and a lot of the
> > new names are unintelligible.  In particular, I suspect that the
> > patch is significantly redesigning when/where run-time pruning
> > happens (unless it's just letting that be run twice); but I don't
> > see any documentation or name changes suggesting where that
> > responsibility is now.
>
> I am sympathetic to that concern. I spent a while staring at a
> baffling comment in 0001 only to discover it had just been moved from
> elsewhere. I really don't feel that things in this are as clear as
> they could be -- although I hasten to add that I respect the people
> who have done work in this area previously and am grateful for what
> they did. It's been a huge benefit to the project in spite of the
> bumps in the road. Moreover, this isn't the only code in PostgreSQL
> that needs improvement, or the worst. That said, I do think there are
> problems. I don't yet have a position on whether this patch is making
> that better or worse.

Okay, I'd like to post a new version with the comments edited to make
them a bit more intelligible.  I understand that the comments around
the new invocation mode(s) of runtime pruning are not as clear as they
should be, especially as the changes that this patch wants to make to
how things work are not very localized.

> That said, I believe that the core idea of the patch is to optionally
> perform pruning before we acquire locks or spin up the main executor
> and then remember the decisions we made. If once the main executor is
> spun up we already made those decisions, then we must stick with what
> we decided. If not, we make those pruning decisions at the same point
> we do currently

Right.  The "initial" pruning, that this patch wants to make occur at
an earlier point (plancache.c), is currently performed in
ExecInit[Merge]Append().

If it does occur early due to the plan being a cached one,
ExecInit[Merge]Append() simply refers to its result that would be made
available via a new data structure that plancache.c has been made to
pass down to the executor alongside the plan tree.

If it does not, ExecInit[Merge]Append() does the pruning in the same
way it does now.  Such cases include initial pruning using only STABLE
expressions that the planner doesn't bother to compute by itself lest
the resulting plan may be cached, but no EXTERN parameters.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

22 March 2022, 12:44:57

On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Also, while I've not spent much time at all reading this patch,
> > > it seems rather desperately undercommented, and a lot of the
> > > new names are unintelligible.  In particular, I suspect that the
> > > patch is significantly redesigning when/where run-time pruning
> > > happens (unless it's just letting that be run twice); but I don't
> > > see any documentation or name changes suggesting where that
> > > responsibility is now.
> >
> > I am sympathetic to that concern. I spent a while staring at a
> > baffling comment in 0001 only to discover it had just been moved from
> > elsewhere. I really don't feel that things in this are as clear as
> > they could be -- although I hasten to add that I respect the people
> > who have done work in this area previously and am grateful for what
> > they did. It's been a huge benefit to the project in spite of the
> > bumps in the road. Moreover, this isn't the only code in PostgreSQL
> > that needs improvement, or the worst. That said, I do think there are
> > problems. I don't yet have a position on whether this patch is making
> > that better or worse.
>
> Okay, I'd like to post a new version with the comments edited to make
> them a bit more intelligible.  I understand that the comments around
> the new invocation mode(s) of runtime pruning are not as clear as they
> should be, especially as the changes that this patch wants to make to
> how things work are not very localized.

Actually, another area where the comments may not be as clear as they
should have been is the changes that the patch makes to the
AcquireExecutorLocks() logic that decides which relations are locked
to safeguard the plan tree for execution, which are those given by
RTE_RELATION entries in the range table.

Without the patch, they are found by actually scanning the range table.

With the patch, it's the same set of RTEs if the plan doesn't contain
any pruning nodes, though instead of the range table, what is scanned
is a bitmapset of their RT indexes that is made available by the
planner in the form of PlannedStmt.lockrels.  When the plan does
contain a pruning node (PlannedStmt.containsInitialPruning), the
bitmapset is constructed by calling ExecutorGetLockRels() on the plan
tree, which walks it to add RT indexes of relations mentioned in the
Scan nodes, while skipping any nodes that are pruned after performing
initial pruning steps that may be present in their containing parent
node's PartitionPruneInfo.  Also, the RT indexes of partitioned tables
that are present in the PartitionPruneInfo itself are also added to
the set.

While expanding comments added by the patch to make this clear, I
realized that there are two problems, one of them quite glaring:

* Planner's constructing this bitmapset and its copying along with the
PlannedStmt is pure overhead in the cases that this patch has nothing
to do with, which is the kind of thing that Andres cautioned against
upthread.

* Not all partitioned tables that would have been locked without the
patch to come up with a Append/MergeAppend plan may be returned by
ExecutorGetLockRels().  For example, if none of the query's
runtime-prunable quals were found to match the partition key of an
intermediate partitioned table and thus that partitioned table not
included in the PartitionPruneInfo.  Or if an Append/MergeAppend
covering a partition tree doesn't contain any PartitionPruneInfo to
begin with, in which case, only the leaf partitions and none of
partitioned parents would be accounted for by the
ExecutorGetLockRels() logic.

The 1st one seems easy to fix by not inventing PlannedStmt.lockrels
and just doing what's being done now: scan the range table if
(!PlannedStmt.containsInitialPruning).

The only way perhaps to fix the second one is to reconsider the
decision we made in the following commit:

    commit 52ed730d511b7b1147f2851a7295ef1fb5273776
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Sun Oct 7 14:33:17 2018 -0400

    Remove some unnecessary fields from Plan trees.

    In the wake of commit f2343653f, we no longer need some fields that
    were used before to control executor lock acquisitions:

    * PlannedStmt.nonleafResultRelations can go away entirely.

    * partitioned_rels can go away from Append, MergeAppend, and ModifyTable.
    However, ModifyTable still needs to know the RT index of the partition
    root table if any, which was formerly kept in the first entry of that
    list.  Add a new field "rootRelation" to remember that.  rootRelation is
    partly redundant with nominalRelation, in that if it's set it will have
    the same value as nominalRelation.  However, the latter field has a
    different purpose so it seems best to keep them distinct.

That is, add back the partitioned_rels field, at least to Append and
MergeAppend, to store the RT indexes of partitioned tables whose
children's paths are present in Append/MergeAppend.subpaths.

Thoughts?

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

28 March 2022, 07:17:00

On Tue, Mar 22, 2022 at 9:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > > Also, while I've not spent much time at all reading this patch,
> > > > it seems rather desperately undercommented, and a lot of the
> > > > new names are unintelligible.  In particular, I suspect that the
> > > > patch is significantly redesigning when/where run-time pruning
> > > > happens (unless it's just letting that be run twice); but I don't
> > > > see any documentation or name changes suggesting where that
> > > > responsibility is now.
> > >
> > > I am sympathetic to that concern. I spent a while staring at a
> > > baffling comment in 0001 only to discover it had just been moved from
> > > elsewhere. I really don't feel that things in this are as clear as
> > > they could be -- although I hasten to add that I respect the people
> > > who have done work in this area previously and am grateful for what
> > > they did. It's been a huge benefit to the project in spite of the
> > > bumps in the road. Moreover, this isn't the only code in PostgreSQL
> > > that needs improvement, or the worst. That said, I do think there are
> > > problems. I don't yet have a position on whether this patch is making
> > > that better or worse.
> >
> > Okay, I'd like to post a new version with the comments edited to make
> > them a bit more intelligible.  I understand that the comments around
> > the new invocation mode(s) of runtime pruning are not as clear as they
> > should be, especially as the changes that this patch wants to make to
> > how things work are not very localized.
>
> Actually, another area where the comments may not be as clear as they
> should have been is the changes that the patch makes to the
> AcquireExecutorLocks() logic that decides which relations are locked
> to safeguard the plan tree for execution, which are those given by
> RTE_RELATION entries in the range table.
>
> Without the patch, they are found by actually scanning the range table.
>
> With the patch, it's the same set of RTEs if the plan doesn't contain
> any pruning nodes, though instead of the range table, what is scanned
> is a bitmapset of their RT indexes that is made available by the
> planner in the form of PlannedStmt.lockrels.  When the plan does
> contain a pruning node (PlannedStmt.containsInitialPruning), the
> bitmapset is constructed by calling ExecutorGetLockRels() on the plan
> tree, which walks it to add RT indexes of relations mentioned in the
> Scan nodes, while skipping any nodes that are pruned after performing
> initial pruning steps that may be present in their containing parent
> node's PartitionPruneInfo.  Also, the RT indexes of partitioned tables
> that are present in the PartitionPruneInfo itself are also added to
> the set.
>
> While expanding comments added by the patch to make this clear, I
> realized that there are two problems, one of them quite glaring:
>
> * Planner's constructing this bitmapset and its copying along with the
> PlannedStmt is pure overhead in the cases that this patch has nothing
> to do with, which is the kind of thing that Andres cautioned against
> upthread.
>
> * Not all partitioned tables that would have been locked without the
> patch to come up with a Append/MergeAppend plan may be returned by
> ExecutorGetLockRels().  For example, if none of the query's
> runtime-prunable quals were found to match the partition key of an
> intermediate partitioned table and thus that partitioned table not
> included in the PartitionPruneInfo.  Or if an Append/MergeAppend
> covering a partition tree doesn't contain any PartitionPruneInfo to
> begin with, in which case, only the leaf partitions and none of
> partitioned parents would be accounted for by the
> ExecutorGetLockRels() logic.
>
> The 1st one seems easy to fix by not inventing PlannedStmt.lockrels
> and just doing what's being done now: scan the range table if
> (!PlannedStmt.containsInitialPruning).

The attached updated patch does it like this.

> The only way perhaps to fix the second one is to reconsider the
> decision we made in the following commit:
>
>     commit 52ed730d511b7b1147f2851a7295ef1fb5273776
>     Author: Tom Lane <tgl@sss.pgh.pa.us>
>     Date:   Sun Oct 7 14:33:17 2018 -0400
>
>     Remove some unnecessary fields from Plan trees.
>
>     In the wake of commit f2343653f, we no longer need some fields that
>     were used before to control executor lock acquisitions:
>
>     * PlannedStmt.nonleafResultRelations can go away entirely.
>
>     * partitioned_rels can go away from Append, MergeAppend, and ModifyTable.
>     However, ModifyTable still needs to know the RT index of the partition
>     root table if any, which was formerly kept in the first entry of that
>     list.  Add a new field "rootRelation" to remember that.  rootRelation is
>     partly redundant with nominalRelation, in that if it's set it will have
>     the same value as nominalRelation.  However, the latter field has a
>     different purpose so it seems best to keep them distinct.
>
> That is, add back the partitioned_rels field, at least to Append and
> MergeAppend, to store the RT indexes of partitioned tables whose
> children's paths are present in Append/MergeAppend.subpaths.

And implemented this in the attached 0002 that reintroduces
partitioned_rels in Append/MergeAppend nodes as a bitmapset of RT
indexes.  The set contains the RT indexes of partitioned ancestors
whose expansion produced the leaf partitions that a given
Append/MergeAppend node scans.   This project needs this way of
knowing the partitioned tables involved in producing an
Append/MergeAppend node, because we'd like to give plancache.c the
ability to glean the set of relations to be locked by scanning a plan
tree to make the tree ready for execution rather than by scanning the
range table and the only relations we're missing in the tree right now
are partitioned tables.

One fly-in-the-ointment situation I faced when doing that is the fact
that setrefs.c in most situations removes the Append/MergeAppend from
the final plan if it contains only one child subplan.  I got around it
by inventing a PlannerGlobal/PlannedStmt.elidedAppendPartedRels set
which is a union of partitioned_rels of all the Append/MergeAppend
nodes in the plan tree that were removed as described.

Other than the changes mentioned above, the updated patch now contains
a bit more commentary than earlier versions, mostly around
AcquireExecutorLocks()'s new way of determining the set of relations
to lock and the significantly redesigned working of the "initial"
execution pruning.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

28 March 2022, 07:28:46

On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Other than the changes mentioned above, the updated patch now contains
> a bit more commentary than earlier versions, mostly around
> AcquireExecutorLocks()'s new way of determining the set of relations
> to lock and the significantly redesigned working of the "initial"
> execution pruning.

Forgot to rebase over the latest HEAD, so here's v7.  Also fixed that
_out and _read functions for PlanInitPruningOutput were using an
obsolete node label.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

31 March 2022, 03:25:20

On Mon, Mar 28, 2022 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Other than the changes mentioned above, the updated patch now contains
> > a bit more commentary than earlier versions, mostly around
> > AcquireExecutorLocks()'s new way of determining the set of relations
> > to lock and the significantly redesigned working of the "initial"
> > execution pruning.
>
> Forgot to rebase over the latest HEAD, so here's v7.  Also fixed that
> _out and _read functions for PlanInitPruningOutput were using an
> obsolete node label.

Rebased.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Thanks for the review.

On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I noticed a definitional problem in 0001 that's also a bug in some
> conditions -- namely that the bitmapset "validplans" is never explicitly
> initialized to NIL.  In the original coding, the BMS was always returned
> from somewhere; in the new code, it is passed from an uninitialized
> stack variable into the new ExecInitPartitionPruning function, which
> then proceeds to add new members to it without initializing it first.

Hmm, the following blocks in ExecInitPartitionPruning() define
*initially_valid_subplans:

    /*
     * Perform an initial partition prune pass, if required.
     */
    if (prunestate->do_initial_prune)
    {
        /* Determine which subplans survive initial pruning */
        *initially_valid_subplans = ExecFindInitialMatchingSubPlans(prunestate);
    }
    else
    {
        /* We'll need to initialize all subplans */
        Assert(n_total_subplans > 0);
        *initially_valid_subplans = bms_add_range(NULL, 0,
                                                  n_total_subplans - 1);
    }

AFAICS, both assign *initially_valid_subplans a value whose
computation is not dependent on reading it first, so I don't see a
problem.

Am I missing something?

> Indeed that function's header comment explicitly indicates that it is
> not initialized:
>
> + * Initial pruning can be done immediately, so it is done here if needed and
> + * the set of surviving partition subplans' indexes are added to the output
> + * parameter *initially_valid_subplans.
>
> even though this is not fully correct, because when prunestate->do_initial_prune
> is false, then the BMS *is* initialized.
>
> I have no opinion on where to initialize it, but it needs to be done
> somewhere and the comment needs to agree.

I can see that the comment is insufficient, so I've expanded it as follows:

- * Initial pruning can be done immediately, so it is done here if needed and
- * the set of surviving partition subplans' indexes are added to the output
- * parameter *initially_valid_subplans.
+ * On return, *initially_valid_subplans is assigned the set of indexes of
+ * child subplans that must be initialized along with the parent plan node.
+ * Initial pruning is performed here if needed and in that case only the
+ * surviving subplans' indexes are added.

> I think the names ExecCreatePartitionPruneState and
> ExecInitPartitionPruning are too confusingly similar.  Maybe the former
> should be renamed to somehow make it clear that it is a subroutine for
> the former.

Ah, yes.  I've taken out the "Exec" from the former.

> At the top of the file, there's a new comment that reads:
>
>   * ExecInitPartitionPruning:
>   *     Creates the PartitionPruneState required by each of the two pruning
>   *     functions.
>
> What are "the two pruning functions"?  I think here you mean "Append"
> and "MergeAppend".  Maybe spell that out explicitly.

Actually it meant: ExecFindInitiaMatchingSubPlans() and
ExecFindMatchingSubPlans().  They perform "initial" and "exec" set of
pruning steps, respectively.

I realized that both functions have identical bodies at this point,
except that they pass 'true' and 'false', respectively, for
initial_prune argument of the sub-routine
find_matching_subplans_recurse(), which is where the pruning using the
appropriate set of steps contained in PartitionPruneState
(initial_pruning_steps or exec_pruning_steps) actually occurs.  So,
I've updated the patch to just retain the latter, adding an
initial_prune parameter to it to pass to the aforementioned
find_matching_subplans_recurse().

I've also updated the run-time pruning module comment to describe this change:

  * ExecFindMatchingSubPlans:
- *     Returns indexes of matching subplans after evaluating all available
- *     expressions, that is, using execution pruning steps.  This function can
- *     can only be called during execution and must be called again each time
- *     the value of a Param listed in PartitionPruneState's 'execparamids'
- *     changes.
+ *     Returns indexes of matching subplans after evaluating the expressions
+ *     that are safe to evaluate at a given point.  This function is first
+ *     called during ExecInitPartitionPruning() to find the initially
+ *     matching subplans based on performing the initial pruning steps and
+ *     then must be called again each time the value of a Param listed in
+ *     PartitionPruneState's 'execparamids' changes.

> I think this comment needs to be reworded:
>
> + * Subplans would previously be indexed 0..(n_total_subplans - 1) should be
> + * changed to index range 0..num(initially_valid_subplans).

Assuming you meant to ask to write this without the odd notation, I've
expanded the comment as follows:

- * Subplans would previously be indexed 0..(n_total_subplans - 1) should be
- * changed to index range 0..num(initially_valid_subplans).
+ * Current values of the indexes present in PartitionPruneState count all the
+ * subplans that would be present before initial pruning was done.  If initial
+ * pruning got rid of some of the subplans, any subsequent pruning passes will
+ * will be looking at a different set of target subplans to choose from than
+ * those in the pre-initial-pruning set, so the maps in PartitionPruneState
+ * containing those indexes must be updated to reflect the new indexes of
+ * subplans in the post-initial-pruning set.

I've attached only the updated 0001, though I'm still working on the
others to address David's comments.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v9-0001-Some-refactoring-of-runtime-pruning-code.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

05 April 2022, 02:29:49

On Mon, Apr 4, 2022 at 9:55 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I think the names ExecCreatePartitionPruneState and
> > ExecInitPartitionPruning are too confusingly similar.  Maybe the former
> > should be renamed to somehow make it clear that it is a subroutine for
> > the former.
>
> Ah, yes.  I've taken out the "Exec" from the former.

While at it, maybe it's better to rename ExecInitPruningContext() to
InitPartitionPruneContext(), which I've done in the attached updated
patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v10-0001-Some-refactoring-of-runtime-pruning-code.patch

Re: generic plans and "initial" pruning

From

Alvaro Herrera

Date:

05 April 2022, 10:00:35

On 2022-Apr-05, Amit Langote wrote:

> While at it, maybe it's better to rename ExecInitPruningContext() to
> InitPartitionPruneContext(), which I've done in the attached updated
> patch.

Good call.  I had changed that name too, but yours seems a better
choice.

I made a few other cosmetic changes and pushed.  I'm afraid this will
cause a few conflicts with your 0004 -- hopefully these should mostly be
minor.

One change that's not completely cosmetic is a change in the test on
whether to call PartitionPruneFixSubPlanMap or not.  Originally it was:

if (partprune->do_exec_prune &&
    bms_num_members( ... ))
        do_stuff();

which meant that bms_num_members() is only evaluated if do_exec_prune.
However, the do_exec_prune bit is an optimization (we can skip doing
that stuff if it's not going to be used), but the other test is more
strict: the stuff is completely irrelevant if no plans have been
removed, since the data structure does not need fixing.  So I changed it
to be like this

if (bms_num_members( .. ))
{
    /* can skip if it's pointless */
    if (do_exec_prune)
        do_stuff();
}

I think that it is clearer to the human reader this way; and I think a
smart compiler may realize that the test can be reversed and avoid
counting bits when it's pointless.

So your 0004 patch should add the new condition to the outer if(), since
it's a critical consideration rather than an optimization:
if (partprune && bms_num_members())
{
    /* can skip if pointless */
    if (do_exec_prune)
        do_stuff()
}

Now, if we disagree and think that counting bits in the BMS when it's
going to be discarded by do_exec_prune being false, then we can flip
that back as originally and a more explicit comment.  With no evidence,
I doubt it matters.

Thanks for the patch!  I think the new coding is indeed a bit easier to
follow.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",
       more or less, right?
<crab> i.e., "deadly poison"

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

05 April 2022, 12:56:02

On Tue, Apr 5, 2022 at 7:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Apr-05, Amit Langote wrote:
> > While at it, maybe it's better to rename ExecInitPruningContext() to
> > InitPartitionPruneContext(), which I've done in the attached updated
> > patch.
>
> Good call.  I had changed that name too, but yours seems a better
> choice.
>
> I made a few other cosmetic changes and pushed.

Thanks!

>  I'm afraid this will
> cause a few conflicts with your 0004 -- hopefully these should mostly be
> minor.
>
> One change that's not completely cosmetic is a change in the test on
> whether to call PartitionPruneFixSubPlanMap or not.  Originally it was:
>
> if (partprune->do_exec_prune &&
>     bms_num_members( ... ))
>         do_stuff();
>
> which meant that bms_num_members() is only evaluated if do_exec_prune.
> However, the do_exec_prune bit is an optimization (we can skip doing
> that stuff if it's not going to be used), but the other test is more
> strict: the stuff is completely irrelevant if no plans have been
> removed, since the data structure does not need fixing.  So I changed it
> to be like this
>
> if (bms_num_members( .. ))
> {
>         /* can skip if it's pointless */
>         if (do_exec_prune)
>                 do_stuff();
> }
>
> I think that it is clearer to the human reader this way; and I think a
> smart compiler may realize that the test can be reversed and avoid
> counting bits when it's pointless.
>
> So your 0004 patch should add the new condition to the outer if(), since
> it's a critical consideration rather than an optimization:
> if (partprune && bms_num_members())
> {
>         /* can skip if pointless */
>         if (do_exec_prune)
>                 do_stuff()
> }
>
> Now, if we disagree and think that counting bits in the BMS when it's
> going to be discarded by do_exec_prune being false, then we can flip
> that back as originally and a more explicit comment.  With no evidence,
> I doubt it matters.

I agree that counting bits in the outer condition makes this easier to
read, so see no problem with keeping it that way.

Will post the rebased main patch soon, whose rewrite I'm close to
being done with.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

06 April 2022, 07:20:46

On Fri, Apr 1, 2022 at 5:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Apr 1, 2022 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote:
> > > Yes, the ExecLockRelsInfo node in the current patch, that first gets
> > > added to the QueryDesc and subsequently to the EState of the query,
> > > serves as that stashing place.  Not sure if you've looked at
> > > ExecLockRelInfo in detail in your review of the patch so far, but it
> > > carries the initial pruning result in what are called
> > > PlanInitPruningOutput nodes, which are stored in a list in
> > > ExecLockRelsInfo and their offsets in the list are in turn stored in
> > > an adjacent array that contains an element for every plan node in the
> > > tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> > > don't need to have that array, because the Append/MergeAppend nodes
> > > would be carrying those offsets by themselves.
> >
> > I saw it, just not in great detail. I saw that you had an array that
> > was indexed by the plan node's ID.  I thought that wouldn't be so good
> > with large complex plans that we often get with partitioning
> > workloads.  That's why I mentioned using another index that you store
> > in Append/MergeAppend that starts at 0 and increments by 1 for each
> > node that has a PartitionPruneInfo made for it during create_plan.
> >
> > > Maybe a different name for ExecLockRelsInfo would be better?
> > >
> > > Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> > > maybe the way I have it now is fine?
> >
> > I think if you change how it's indexed and the other stuff then we can
> > have another look.  I think the patch will be much easier to review
> > once the ParitionPruneInfos are moved into PlannedStmt.
>
> Will do, thanks.

And here is a version like that that passes make check-world.  Maybe
still a WIP as I think comments could use more editing.

Here's how the new implementation works:

AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn
iterates over a list of PartitionPruneInfos in a given PlannedStmt
coming from a CachedPlan.  For each PartitionPruneInfo,
ExecPartitionDoInitialPruning() is called, which sets up
PartitionPruneState and performs initial pruning steps present in the
PartitionPruneInfo.  The resulting bitmapsets of valid subplans, one
for each PartitionPruneInfo, are collected in a list and added to a
result node called PartitionPruneResult.  It represents the result of
performing initial pruning on all PartitionPruneInfos found in a plan.
A list of PartitionPruneResults is passed along with the PlannedStmt
to the executor, which is referenced when initializing
Append/MergeAppend nodes.

PlannedStmt.minLockRelids defined by the planner contains the RT
indexes of all the entries in the range table minus those of the leaf
partitions whose subplans are subject to removal due to initial
pruning.  AcquireExecutoLocks() adds back the RT indexes of only those
leaf partitions whose subplans survive ExecutorDoInitialPruning().  To
get the leaf partition RT indexes from the PartitionPruneInfo, a new
rti_map array is added to PartitionedRelPruneInfo.

There's only one patch this time.  Patches that added partitioned_rels
and plan_tree_walker() are no longer necessary.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v11-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

07 April 2022, 08:27:50

On Wed, Apr 6, 2022 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> And here is a version like that that passes make check-world.  Maybe
> still a WIP as I think comments could use more editing.
>
> Here's how the new implementation works:
>
> AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn
> iterates over a list of PartitionPruneInfos in a given PlannedStmt
> coming from a CachedPlan.  For each PartitionPruneInfo,
> ExecPartitionDoInitialPruning() is called, which sets up
> PartitionPruneState and performs initial pruning steps present in the
> PartitionPruneInfo.  The resulting bitmapsets of valid subplans, one
> for each PartitionPruneInfo, are collected in a list and added to a
> result node called PartitionPruneResult.  It represents the result of
> performing initial pruning on all PartitionPruneInfos found in a plan.
> A list of PartitionPruneResults is passed along with the PlannedStmt
> to the executor, which is referenced when initializing
> Append/MergeAppend nodes.
>
> PlannedStmt.minLockRelids defined by the planner contains the RT
> indexes of all the entries in the range table minus those of the leaf
> partitions whose subplans are subject to removal due to initial
> pruning.  AcquireExecutoLocks() adds back the RT indexes of only those
> leaf partitions whose subplans survive ExecutorDoInitialPruning().  To
> get the leaf partition RT indexes from the PartitionPruneInfo, a new
> rti_map array is added to PartitionedRelPruneInfo.
>
> There's only one patch this time.  Patches that added partitioned_rels
> and plan_tree_walker() are no longer necessary.

Here's an updated version.  In Particular, I removed
part_prune_results list from PortalData, in favor of anything that
needs to look at the list can instead get it from the CachedPlan
(PortalData.cplan).  This makes things better in 2 ways:

* All the changes that were needed to produce the list to be pass to
PortalDefineQuery() are now unnecessary (especially ugly ones were
those made to pg_plan_queries()'s interface)

* The cases in which the PartitionPruneResult being added to a
QueryDesc can be assumed to be valid is more clearly define now; it's
the cases where the portal's CachedPlan is also valid, that is, if the
accompanying PlannedStmt is a cached one.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v12-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch

Re: generic plans and "initial" pruning

From

David Rowley

Date:

07 April 2022, 12:41:13

On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote:
> Here's an updated version.  In Particular, I removed
> part_prune_results list from PortalData, in favor of anything that
> needs to look at the list can instead get it from the CachedPlan
> (PortalData.cplan).  This makes things better in 2 ways:

Thanks for making those changes.

I'm not overly familiar with the data structures we use for planning
around plans between the planner and executor, but storing the pruning
results in CachedPlan seems pretty bad. I see you've stashed it in
there and invented a new memory context to stop leaks into the cache
memory.

Since I'm not overly familiar with these structures, I'm trying to
imagine why you made that choice and the best I can come up with was
that it was the most convenient thing you had to hand inside
CheckCachedPlan().

I don't really have any great ideas right now on how to make this
better. I wonder if GetCachedPlan() should be changed to return some
struct that wraps up the CachedPlan with some sort of executor prep
info struct that we can stash the list of PartitionPruneResults in,
and perhaps something else one day.

Some lesser important stuff that I think could be done better.

* Are you also able to put meaningful comments on the
PartitionPruneResult struct in execnodes.h?

* In create_append_plan() and create_merge_append_plan() you have the
same code to set the part_prune_index. Why not just move all that code
into make_partition_pruneinfo() and have make_partition_pruneinfo()
return the index and append to the PlannerInfo.partPruneInfos List?

* Why not forboth() here?

i = 0;
foreach(stmtlist_item, portal->stmts)
{
PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item);
PartitionPruneResult *part_prune_result = part_prune_results ?
  list_nth(part_prune_results, i) :
  NULL;

i++;

* It would be good if ReleaseExecutorLocks() already knew the RTIs
that were locked. Maybe the executor prep info struct I mentioned
above could also store the RTIs that have been locked already and
allow ReleaseExecutorLocks() to just iterate over those to release the
locks.

David

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

08 April 2022, 05:49:39

On Thu, Apr 7, 2022 at 9:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote:
> > Here's an updated version.  In Particular, I removed
> > part_prune_results list from PortalData, in favor of anything that
> > needs to look at the list can instead get it from the CachedPlan
> > (PortalData.cplan).  This makes things better in 2 ways:
>
> Thanks for making those changes.
>
> I'm not overly familiar with the data structures we use for planning
> around plans between the planner and executor, but storing the pruning
> results in CachedPlan seems pretty bad. I see you've stashed it in
> there and invented a new memory context to stop leaks into the cache
> memory.
>
> Since I'm not overly familiar with these structures, I'm trying to
> imagine why you made that choice and the best I can come up with was
> that it was the most convenient thing you had to hand inside
> CheckCachedPlan().

Yeah, it's that way because it felt convenient, though I have wondered
if a simpler scheme that doesn't require any changes to the CachedPlan
data structure might be better after all.  Your pointing it out has
made me think a bit harder on that.

> I don't really have any great ideas right now on how to make this
> better. I wonder if GetCachedPlan() should be changed to return some
> struct that wraps up the CachedPlan with some sort of executor prep
> info struct that we can stash the list of PartitionPruneResults in,
> and perhaps something else one day.

I think what might be better to do now is just add an output List
parameter to GetCachedPlan() to add the PartitionPruneResult node to
instead of stashing them into CachedPlan as now.  IMHO, we should
leave inventing a new generic struct to the next project that will
make it necessary to return more information from GetCachedPlan() to
its users.  I find it hard to convincingly describe what the new
generic struct really is if we invent it *now*, when it's going to
carry a single list whose purpose is pretty narrow.

So, I've implemented this by making the callers of GetCachedPlan()
pass a list to add the PartitionPruneResults that may be produced.
Most callers can put that into the Portal for passing that to other
modules, so I have reinstated PortalData.part_prune_results.  As for
its memory management, the list and the PartitionPruneResults therein
will be allocated in a context that holds the Portal itself.

> Some lesser important stuff that I think could be done better.
>
> * Are you also able to put meaningful comments on the
> PartitionPruneResult struct in execnodes.h?
>
> * In create_append_plan() and create_merge_append_plan() you have the
> same code to set the part_prune_index. Why not just move all that code
> into make_partition_pruneinfo() and have make_partition_pruneinfo()
> return the index and append to the PlannerInfo.partPruneInfos List?

That sounds better, so done.

> * Why not forboth() here?
>
> i = 0;
> foreach(stmtlist_item, portal->stmts)
> {
> PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item);
> PartitionPruneResult *part_prune_result = part_prune_results ?
>   list_nth(part_prune_results, i) :
>   NULL;
>
> i++;

Because the PartitionPruneResult list may not always be available.  To
wit, it's only available when it is GetCachedPlan() that gave the
portal its plan.  I know this is a bit ugly, but it seems better than
fixing all users of Portal to build a dummy list, not that it is
totally avoidable even in the current implementation.

> * It would be good if ReleaseExecutorLocks() already knew the RTIs
> that were locked. Maybe the executor prep info struct I mentioned
> above could also store the RTIs that have been locked already and
> allow ReleaseExecutorLocks() to just iterate over those to release the
> locks.

Rewrote this such that ReleaseExecutorLocks() just receives a list of
per-PlannedStmt bitmapsets containing the RT indexes of only the
locked entries in that plan.

Attached updated patch with these changes.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v13-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch

Re: generic plans and "initial" pruning

From

David Rowley

Date:

08 April 2022, 11:15:53

On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote:
> Attached updated patch with these changes.

Thanks for making the changes.  I started looking over this patch but
really feel like it needs quite a few more iterations of what we've
just been doing to get it into proper committable shape. There seems
to be only about 40 mins to go before the freeze, so it seems very
unrealistic that it could be made to work.

I started trying to take a serious look at it this evening, but I feel
like I just failed to get into it deep enough to make any meaningful
improvements.  I'd need more time to study the problem before I could
build up a proper opinion on how exactly I think it should work.

Anyway. I've attached a small patch that's just a few things I
adjusted or questions while reading over your v13 patch.  Some of
these are just me questioning your code (See XXX comments) and some I
think are improvements. Feel free to take the hunks that you see fit
and drop anything you don't.

David

Attachment

misc_fixes.patch.txt

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

08 April 2022, 11:45:37

Hi David,

On Fri, Apr 8, 2022 at 8:16 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached updated patch with these changes.
> Thanks for making the changes.  I started looking over this patch but
> really feel like it needs quite a few more iterations of what we've
> just been doing to get it into proper committable shape. There seems
> to be only about 40 mins to go before the freeze, so it seems very
> unrealistic that it could be made to work.

Yeah, totally understandable.

> I started trying to take a serious look at it this evening, but I feel
> like I just failed to get into it deep enough to make any meaningful
> improvements.  I'd need more time to study the problem before I could
> build up a proper opinion on how exactly I think it should work.
>
> Anyway. I've attached a small patch that's just a few things I
> adjusted or questions while reading over your v13 patch.  Some of
> these are just me questioning your code (See XXX comments) and some I
> think are improvements. Feel free to take the hunks that you see fit
> and drop anything you don't.

Thanks a lot for compiling those.

Most looked fine changes to me except a couple of typos, so I've
adopted those into the attached new version, even though I know it's
too late to try to apply it.  Re the XXX comments:

+ /* XXX why would pprune->rti_map[i] ever be zero here??? */

Yeah, no there can't be, was perhaps being overly paraioid.

+ * XXX is it worth doing a bms_copy() on glob->minLockRelids if
+ * glob->containsInitialPruning is true?. I'm slighly worried that the
+ * Bitmapset could have a very long empty tail resulting in excessive
+ * looping during AcquireExecutorLocks().
+ */

I guess I trust your instincts about bitmapset operation efficiency
and what you've written here makes sense.  It's typical for leaf
partitions to have been appended toward the tail end of rtable and I'd
imagine their indexes would be in the tail words of minLockRelids.  If
copying the bitmapset removes those useless words, I don't see why we
shouldn't do that.  So added:

+ /*
+ * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
+ * bit from it just above to prevent empty tail bits resulting in
+ * inefficient looping during AcquireExecutorLocks().
+ */
+ if (glob->containsInitialPruning)
+ glob->minLockRelids = bms_copy(glob->minLockRelids)

Not 100% about the comment I wrote.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v14-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

11 April 2022, 03:05:19

On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Most looked fine changes to me except a couple of typos, so I've
> adopted those into the attached new version, even though I know it's
> too late to try to apply it.
>
> + * XXX is it worth doing a bms_copy() on glob->minLockRelids if
> + * glob->containsInitialPruning is true?. I'm slighly worried that the
> + * Bitmapset could have a very long empty tail resulting in excessive
> + * looping during AcquireExecutorLocks().
> + */
>
> I guess I trust your instincts about bitmapset operation efficiency
> and what you've written here makes sense.  It's typical for leaf
> partitions to have been appended toward the tail end of rtable and I'd
> imagine their indexes would be in the tail words of minLockRelids.  If
> copying the bitmapset removes those useless words, I don't see why we
> shouldn't do that.  So added:
>
> + /*
> + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
> + * bit from it just above to prevent empty tail bits resulting in
> + * inefficient looping during AcquireExecutorLocks().
> + */
> + if (glob->containsInitialPruning)
> + glob->minLockRelids = bms_copy(glob->minLockRelids)
>
> Not 100% about the comment I wrote.

And the quoted code change missed a semicolon in the v14 that I
hurriedly sent on Friday.   (Had apparently forgotten to `git add` the
hunk to fix that).

Sending v15 that fixes that to keep the cfbot green for now.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v15-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch

Re: generic plans and "initial" pruning

From

Zhihong Yu

Date:

11 April 2022, 03:58:23

On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Most looked fine changes to me except a couple of typos, so I've
> adopted those into the attached new version, even though I know it's
> too late to try to apply it.
>
> + * XXX is it worth doing a bms_copy() on glob->minLockRelids if
> + * glob->containsInitialPruning is true?. I'm slighly worried that the
> + * Bitmapset could have a very long empty tail resulting in excessive
> + * looping during AcquireExecutorLocks().
> + */
>
> I guess I trust your instincts about bitmapset operation efficiency
> and what you've written here makes sense. It's typical for leaf
> partitions to have been appended toward the tail end of rtable and I'd
> imagine their indexes would be in the tail words of minLockRelids. If
> copying the bitmapset removes those useless words, I don't see why we
> shouldn't do that. So added:
>
> + /*
> + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
> + * bit from it just above to prevent empty tail bits resulting in
> + * inefficient looping during AcquireExecutorLocks().
> + */
> + if (glob->containsInitialPruning)
> + glob->minLockRelids = bms_copy(glob->minLockRelids)
>
> Not 100% about the comment I wrote.

And the quoted code change missed a semicolon in the v14 that I
hurriedly sent on Friday. (Had apparently forgotten to `git add` the
hunk to fix that).

Sending v15 that fixes that to keep the cfbot green for now.

--
Amit Langote
EDB: http://www.enterprisedb.com

Hi,

+ /* RT index of the partitione table. */

partitione -> partitioned

Cheers

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

27 May 2022, 08:09:46

On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> Sending v15 that fixes that to keep the cfbot green for now.
>
> Hi,
>
> +               /* RT index of the partitione table. */
>
> partitione -> partitioned

Thanks, fixed.

Also, I broke this into patches:

0001 contains the mechanical changes of moving PartitionPruneInfo out
of Append/MergeAppend into a list in PlannedStmt.

0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
only unpruned partitions".

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Zhihong Yu

Date:

27 May 2022, 20:08:39

On Fri, May 27, 2022 at 1:10 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> Sending v15 that fixes that to keep the cfbot green for now.
>
> Hi,
>
> + /* RT index of the partitione table. */
>
> partitione -> partitioned

Thanks, fixed.

Also, I broke this into patches:

0001 contains the mechanical changes of moving PartitionPruneInfo out
of Append/MergeAppend into a list in PlannedStmt.

0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
only unpruned partitions".

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Hi,

In the description:

is made available to the actual execution via

PartitionPruneResult, made available along with the PlannedStmt by the

I think the second `made available` is redundant (can be omitted).

+ * Initial pruning is performed here if needed (unless it has already been done
+ * by ExecDoInitialPruning()), and in that case only the surviving subplans'

I wonder if there is a typo above - I don't find ExecDoInitialPruning either in PG codebase or in the patches (except for this in the comment).

I think it should be ExecutorDoInitialPruning.

+ * bit from it just above to prevent empty tail bits resulting in

I searched in the code base but didn't find mentioning of `empty tail bit`. Do you mind explaining a bit about it ?

Cheers

Re: generic plans and "initial" pruning

From

Jacob Champion

Date:

05 July 2022, 17:43:21

On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote:
> 0001 contains the mechanical changes of moving PartitionPruneInfo out
> of Append/MergeAppend into a list in PlannedStmt.
>
> 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
> only unpruned partitions".

This patchset will need to be rebased over 835d476fd21; looks like
just a cosmetic change.

--Jacob

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

06 July 2022, 02:37:57

On Wed, Jul 6, 2022 at 2:43 AM Jacob Champion <jchampion@timescale.com> wrote:
> On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > 0001 contains the mechanical changes of moving PartitionPruneInfo out
> > of Append/MergeAppend into a list in PlannedStmt.
> >
> > 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
> > only unpruned partitions".
>
> This patchset will need to be rebased over 835d476fd21; looks like
> just a cosmetic change.

Thanks for the heads up.

Rebased and also fixed per comments given by Zhihong Yu on May 28.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

13 July 2022, 06:40:10

Rebased over 964d01ae90c.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

13 July 2022, 07:03:51

On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Rebased over 964d01ae90c.

Sorry, left some pointless hunks in there while rebasing.  Fixed in
the attached.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

27 July 2022, 03:00:57

On Wed, Jul 13, 2022 at 4:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Rebased over 964d01ae90c.
>
> Sorry, left some pointless hunks in there while rebasing.  Fixed in
> the attached.

Needed to be rebased again, over 2d04277121f this time.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > 0001 adds es_part_prune_result but does not use it, so maybe the
> > introduction of that field should be deferred until it's needed for
> > something.
>
> Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.

Fixed that and also noticed that I had defined PartitionPruneResult in
the wrong header (execnodes.h).  That led to PartitionPruneResult
nodes not being able to be written and read, because
src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
routines for the nodes defined in execnodes.h.  I moved its definition
to plannodes.h, even though it is not actually the planner that
instantiates those; no other include/nodes header sounds better.

One more thing I realized is that Bitmapsets added to the List
PartitionPruneResult.valid_subplan_offs_list are not actually
read/write-able.  That's a problem that I also faced in [1], so I
proposed a patch there to make Bitmapset a read/write-able Node and
mark (only) the Bitmapsets that are added into read/write-able node
trees with the corresponding NodeTag.  I'm including that patch here
as well (0002) for the main patch to work (pass
-DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
to discuss it in its own thread?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqH80qX1ZLx3HyHmBrOzLQeuKuGx6FzGep0F_9zw9L4PAA%40mail.gmail.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

17 October 2022, 09:29:48

On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > introduction of that field should be deferred until it's needed for
> > > something.
> >
> > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
>
> Fixed that and also noticed that I had defined PartitionPruneResult in
> the wrong header (execnodes.h).  That led to PartitionPruneResult
> nodes not being able to be written and read, because
> src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> routines for the nodes defined in execnodes.h.  I moved its definition
> to plannodes.h, even though it is not actually the planner that
> instantiates those; no other include/nodes header sounds better.
>
> One more thing I realized is that Bitmapsets added to the List
> PartitionPruneResult.valid_subplan_offs_list are not actually
> read/write-able.  That's a problem that I also faced in [1], so I
> proposed a patch there to make Bitmapset a read/write-able Node and
> mark (only) the Bitmapsets that are added into read/write-able node
> trees with the corresponding NodeTag.  I'm including that patch here
> as well (0002) for the main patch to work (pass
> -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> to discuss it in its own thread?

Had second thoughts on the use of List of Bitmapsets for this, such
that the make-Bitmapset-Nodes patch is no longer needed.

I had defined PartitionPruneResult such that it stood for the results
of pruning for all PartitionPruneInfos contained in
PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
can use partition pruning in a given plan).  So, it had a List of
Bitmapset.  I think it's perhaps better for PartitionPruneResult to
cover only one PartitionPruneInfo and thus need only a Bitmapset and
not a List thereof, which I have implemented in the attached updated
patch 0002.  So, instead of needing to pass around a
PartitionPruneResult with each PlannedStmt, this now passes a List of
PartitionPruneResult with an entry for each in
PlannedStmt.partPruneInfos.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

27 October 2022, 02:41:55

On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > introduction of that field should be deferred until it's needed for
> > > > something.
> > >
> > > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
> >
> > Fixed that and also noticed that I had defined PartitionPruneResult in
> > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > nodes not being able to be written and read, because
> > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > routines for the nodes defined in execnodes.h.  I moved its definition
> > to plannodes.h, even though it is not actually the planner that
> > instantiates those; no other include/nodes header sounds better.
> >
> > One more thing I realized is that Bitmapsets added to the List
> > PartitionPruneResult.valid_subplan_offs_list are not actually
> > read/write-able.  That's a problem that I also faced in [1], so I
> > proposed a patch there to make Bitmapset a read/write-able Node and
> > mark (only) the Bitmapsets that are added into read/write-able node
> > trees with the corresponding NodeTag.  I'm including that patch here
> > as well (0002) for the main patch to work (pass
> > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > to discuss it in its own thread?
>
> Had second thoughts on the use of List of Bitmapsets for this, such
> that the make-Bitmapset-Nodes patch is no longer needed.
>
> I had defined PartitionPruneResult such that it stood for the results
> of pruning for all PartitionPruneInfos contained in
> PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> can use partition pruning in a given plan).  So, it had a List of
> Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> cover only one PartitionPruneInfo and thus need only a Bitmapset and
> not a List thereof, which I have implemented in the attached updated
> patch 0002.  So, instead of needing to pass around a
> PartitionPruneResult with each PlannedStmt, this now passes a List of
> PartitionPruneResult with an entry for each in
> PlannedStmt.partPruneInfos.

Rebased over 3b2db22fe.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

08 November 2022, 06:22:32

On Thu, Oct 27, 2022 at 11:41 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > > introduction of that field should be deferred until it's needed for
> > > > > something.
> > > >
> > > > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
> > >
> > > Fixed that and also noticed that I had defined PartitionPruneResult in
> > > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > > nodes not being able to be written and read, because
> > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > > routines for the nodes defined in execnodes.h.  I moved its definition
> > > to plannodes.h, even though it is not actually the planner that
> > > instantiates those; no other include/nodes header sounds better.
> > >
> > > One more thing I realized is that Bitmapsets added to the List
> > > PartitionPruneResult.valid_subplan_offs_list are not actually
> > > read/write-able.  That's a problem that I also faced in [1], so I
> > > proposed a patch there to make Bitmapset a read/write-able Node and
> > > mark (only) the Bitmapsets that are added into read/write-able node
> > > trees with the corresponding NodeTag.  I'm including that patch here
> > > as well (0002) for the main patch to work (pass
> > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > > to discuss it in its own thread?
> >
> > Had second thoughts on the use of List of Bitmapsets for this, such
> > that the make-Bitmapset-Nodes patch is no longer needed.
> >
> > I had defined PartitionPruneResult such that it stood for the results
> > of pruning for all PartitionPruneInfos contained in
> > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> > can use partition pruning in a given plan).  So, it had a List of
> > Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> > cover only one PartitionPruneInfo and thus need only a Bitmapset and
> > not a List thereof, which I have implemented in the attached updated
> > patch 0002.  So, instead of needing to pass around a
> > PartitionPruneResult with each PlannedStmt, this now passes a List of
> > PartitionPruneResult with an entry for each in
> > PlannedStmt.partPruneInfos.
>
> Rebased over 3b2db22fe.

Updated 0002 to cope with AssertArg() being removed from the tree.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Alvaro Herrera

Date:

30 November 2022, 18:12:01

Looking at 0001, I wonder if we should have a crosscheck that a
PartitionPruneInfo you got from following an index is indeed constructed
for the relation that you think it is: previously, you were always sure
that the prune struct is for this node, because you followed a pointer
that was set up in the node itself.  Now you only have an index, and you
have to trust that the index is correct.

I'm not sure how to implement this, or even if it's doable at all.
Keeping the OID of the partitioned table in the PartitionPruneInfo
struct is easy, but I don't know how to check it in ExecInitMergeAppend
and ExecInitAppend.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

01 December 2022, 07:59:25

Hi Alvaro,

Thanks for looking at this one.

On Thu, Dec 1, 2022 at 3:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Looking at 0001, I wonder if we should have a crosscheck that a
> PartitionPruneInfo you got from following an index is indeed constructed
> for the relation that you think it is: previously, you were always sure
> that the prune struct is for this node, because you followed a pointer
> that was set up in the node itself.  Now you only have an index, and you
> have to trust that the index is correct.

Yeah, a crosscheck sounds like a good idea.

> I'm not sure how to implement this, or even if it's doable at all.
> Keeping the OID of the partitioned table in the PartitionPruneInfo
> struct is easy, but I don't know how to check it in ExecInitMergeAppend
> and ExecInitAppend.

Hmm, how about keeping the [Merge]Append's parent relation's RT index
in the PartitionPruneInfo and passing it down to
ExecInitPartitionPruning() from ExecInit[Merge]Append() for
cross-checking?  Both Append and MergeAppend already have a
'apprelids' field that we can save a copy of in the
PartitionPruneInfo.  Tried that in the attached delta patch.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

PartitionPruneInfo-relids.patch

Re: generic plans and "initial" pruning

From

Alvaro Herrera

Date:

01 December 2022, 11:21:06

On 2022-Dec-01, Amit Langote wrote:

> Hmm, how about keeping the [Merge]Append's parent relation's RT index
> in the PartitionPruneInfo and passing it down to
> ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> cross-checking?  Both Append and MergeAppend already have a
> 'apprelids' field that we can save a copy of in the
> PartitionPruneInfo.  Tried that in the attached delta patch.

Ah yeah, that sounds about what I was thinking.  I've merged that in and
pushed to github, which had a strange pg_upgrade failure on Windows
mentioning log files that were not captured by the CI tooling.  So I
pushed another one trying to grab those files, in case it wasn't an
one-off failure.  It's running now:
  https://cirrus-ci.com/task/5857239638999040

If all goes well with this run, I'll get this 0001 pushed.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

01 December 2022, 12:43:28

On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-01, Amit Langote wrote:
> > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > in the PartitionPruneInfo and passing it down to
> > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > cross-checking?  Both Append and MergeAppend already have a
> > 'apprelids' field that we can save a copy of in the
> > PartitionPruneInfo.  Tried that in the attached delta patch.
>
> Ah yeah, that sounds about what I was thinking.  I've merged that in and
> pushed to github, which had a strange pg_upgrade failure on Windows
> mentioning log files that were not captured by the CI tooling.  So I
> pushed another one trying to grab those files, in case it wasn't an
> one-off failure.  It's running now:
>   https://cirrus-ci.com/task/5857239638999040
>
> If all goes well with this run, I'll get this 0001 pushed.

Thanks for pushing 0001.

Rebased 0002 attached.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v25-0001-Optimize-AcquireExecutorLocks-by-locking-only-un.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

02 December 2022, 10:40:42

On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-Dec-01, Amit Langote wrote:
> > > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > > in the PartitionPruneInfo and passing it down to
> > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > > cross-checking?  Both Append and MergeAppend already have a
> > > 'apprelids' field that we can save a copy of in the
> > > PartitionPruneInfo.  Tried that in the attached delta patch.
> >
> > Ah yeah, that sounds about what I was thinking.  I've merged that in and
> > pushed to github, which had a strange pg_upgrade failure on Windows
> > mentioning log files that were not captured by the CI tooling.  So I
> > pushed another one trying to grab those files, in case it wasn't an
> > one-off failure.  It's running now:
> >   https://cirrus-ci.com/task/5857239638999040
> >
> > If all goes well with this run, I'll get this 0001 pushed.
>
> Thanks for pushing 0001.
>
> Rebased 0002 attached.

Thought it might be good for PartitionPruneResult to also have
root_parent_relids that matches with the corresponding
PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
that the root_parent_relids of a given pair of PartitionPrune{Info |
Result} match.

Posting the patch separately as the attached 0002, just in case you
might think that the extra cross-checking would be an overkill.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

05 December 2022, 03:00:01

On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > On 2022-Dec-01, Amit Langote wrote:
> > > > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > > > in the PartitionPruneInfo and passing it down to
> > > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > > > cross-checking?  Both Append and MergeAppend already have a
> > > > 'apprelids' field that we can save a copy of in the
> > > > PartitionPruneInfo.  Tried that in the attached delta patch.
> > >
> > > Ah yeah, that sounds about what I was thinking.  I've merged that in and
> > > pushed to github, which had a strange pg_upgrade failure on Windows
> > > mentioning log files that were not captured by the CI tooling.  So I
> > > pushed another one trying to grab those files, in case it wasn't an
> > > one-off failure.  It's running now:
> > >   https://cirrus-ci.com/task/5857239638999040
> > >
> > > If all goes well with this run, I'll get this 0001 pushed.
> >
> > Thanks for pushing 0001.
> >
> > Rebased 0002 attached.
>
> Thought it might be good for PartitionPruneResult to also have
> root_parent_relids that matches with the corresponding
> PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
> that the root_parent_relids of a given pair of PartitionPrune{Info |
> Result} match.
>
> Posting the patch separately as the attached 0002, just in case you
> might think that the extra cross-checking would be an overkill.

Rebased over 92c4dafe1eed and fixed some factual mistakes in the
comment above ExecutorDoInitialPruning().

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

05 December 2022, 06:08:09

On Mon, Dec 5, 2022 at 12:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Thought it might be good for PartitionPruneResult to also have
> > root_parent_relids that matches with the corresponding
> > PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
> > that the root_parent_relids of a given pair of PartitionPrune{Info |
> > Result} match.
> >
> > Posting the patch separately as the attached 0002, just in case you
> > might think that the extra cross-checking would be an overkill.
>
> Rebased over 92c4dafe1eed and fixed some factual mistakes in the
> comment above ExecutorDoInitialPruning().

Sorry, I had forgotten to git-add hunks including some cosmetic
changes in that one.  Here's another version.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Tue, Dec 13, 2022 at 2:24 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-12, Amit Langote wrote:
> > I started feeling like putting all the new logic being added
> > by this patch into plancache.c at the heart of GetCachedPlan() and
> > tweaking its API in kind of unintuitive ways may not have been such a
> > good idea to begin with.  So I started thinking again about your
> > GetRunnablePlan() wrapper idea and thought maybe we could do something
> > with it.  Let's say we name it GetCachedPlanLockPartitions() and put
> > the logic that does initial pruning with the new
> > ExecutorDoInitialPruning() in it, instead of in the normal
> > GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
> > call GetCachedPlanLockPartitions() with either the List ** parameter
> > as now or some container struct if that seems better.  Whether
> > GetCachedPlanLockPartitions() needs to do anything other than return
> > the CachedPlan returned by GetCachedPlan() can be decided by the
> > latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
> > done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
> > any of the PlannedStmts it sees, locking only the
> > PlannedStmt.minLockRelids set (which is all relations where no pruning
> > is needed!), leaving the partition locking to
> > GetCachedPlanLockPartitions().
>
> Hmm.  This doesn't sound totally unreasonable, except to the point David
> was making that perhaps we may want this container struct to accomodate
> other things in the future than just the partition pruning results, so I
> think its name (and that of the function that produces it) ought to be a
> little more generic than that.
>
> (I think this also answers your question on whether a List ** is better
> than a container struct.)

OK, so here's a WIP attempt at that.

I have moved the original functionality of GetCachedPlan() to
GetCachedPlanInternal(), turning the former into a sort of controller
as described shortly.  The latter's CheckCachedPlan() part now only
locks the "minimal" set of, non-prunable, relations, making a note of
whether the plan contains any prunable subnodes and thus prunable
relations whose locking is deferred to the caller, GetCachedPlan().
GetCachedPlan(), as a sort of controller as mentioned before, does the
pruning if needed on the minimally valid plan returned by
GetCachedPlanInternal(), locks the partitions that survive, and redoes
the whole thing if the locking of partitions invalidates the plan.

The pruning results are returned through the new output parameter of
GetCachedPlan() of type CachedPlanExtra.  I named it so after much
consideration, because all the new logic that produces stuff to put
into it is a part of the plancache module and has to do with
manipulating a CachedPlan.  (I had considered CachedPlanExecInfo to
indicate that it contains information that is to be forwarded to the
executor, though that just didn't seem to fit in plancache.h.)

I have broken out a few things into a preparatory patch 0001.  Mainly,
it invents PlannedStmt.minLockRelids to replace the
AcquireExecutorLocks()'s current loop over the range table to figure
out the relations to lock.  I also threw in a couple of pruning
related non-functional changes in there to make it easier to read the
0002, which is the main patch.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

16 December 2022, 02:33:53

On Wed, Dec 14, 2022 at 5:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I have moved the original functionality of GetCachedPlan() to
> GetCachedPlanInternal(), turning the former into a sort of controller
> as described shortly.  The latter's CheckCachedPlan() part now only
> locks the "minimal" set of, non-prunable, relations, making a note of
> whether the plan contains any prunable subnodes and thus prunable
> relations whose locking is deferred to the caller, GetCachedPlan().
> GetCachedPlan(), as a sort of controller as mentioned before, does the
> pruning if needed on the minimally valid plan returned by
> GetCachedPlanInternal(), locks the partitions that survive, and redoes
> the whole thing if the locking of partitions invalidates the plan.

After sleeping on it, I realized this doesn't have to be that
complicated.   Rather than turn GetCachedPlan() into a wrapper for
handling deferred partition locking as outlined above, I could have
changed it more simply as follows to get the same thing done:

    if (!customplan)
    {
-       if (CheckCachedPlan(plansource))
+       bool    hasUnlockedParts = false;
+
+       if (CheckCachedPlan(plansource, &hasUnlockedParts) &&
+           hasUnlockedParts &&
+           CachedPlanLockPartitions(plansource, boundParams, owner, extra))
        {
            /* We want a generic plan, and we already have a valid one */
            plan = plansource->gplan;

Attached updated patch does it like that.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Alright, I'll try to get something out early next week.  Thanks for
> all the pointers.

Sorry for the delay.  Attached is what I've come up with so far.

I didn't actually go with calling the plancache on every lock taken on
a relation, that is, in ExecGetRangeTableRelation().  One thing about
doing it that way that I didn't quite like (or didn't see a clean
enough way to code) is the need to complicate the ExecInitNode()
traversal for handling the abrupt suspension of the ongoing setup of
the PlanState tree.

So, I decided to keep the current model of locking all the relations
that need to be locked before doing anything else in InitPlan(), much
as how AcquireExecutorLocks() does it.   A new function called from
the top of InitPlan that I've called ExecLockRelationsIfNeeded() does
that locking after performing the initial pruning in the same manner
as the earlier patch did.  That does mean that I needed to keep all
the adjustments of the pruning code that are required for such
out-of-ExecInitNode() invocation of initial pruning, including those
PartitionPruneResult to carry the result of that pruning for
ExecInitNode()-time reuse, though they no longer need be passed
through many unrelated interfaces.

Anyways, here's a description of the patches:

0001 adjusts various call sites of ExecutorStart() to cope with the
possibility of being asked to recreate a CachedPlan, if one is
involved.  The main objective here is to have as little stuff as
sensible happen between GetCachedPlan() that returned the CachedPlan
and ExecutorStart() so as to minimize the chances of missing cleaning
up resources that must not be missed.

0002 is preparatory refactoring to make out-of-ExecInitNode()
invocation of pruning possible.

0003 moves the responsibility of CachedPlan validation locking into
ExecutorStart() as described above.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

02 February 2023, 14:49:58

On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Alright, I'll try to get something out early next week.  Thanks for
> > all the pointers.
>
> Sorry for the delay.  Attached is what I've come up with so far.
>
> I didn't actually go with calling the plancache on every lock taken on
> a relation, that is, in ExecGetRangeTableRelation().  One thing about
> doing it that way that I didn't quite like (or didn't see a clean
> enough way to code) is the need to complicate the ExecInitNode()
> traversal for handling the abrupt suspension of the ongoing setup of
> the PlanState tree.

OK, I gave this one more try and attached is what I came up with.

This adds a ExecPlanStillValid(), which is called right after anything
that may in turn call ExecGetRangeTableRelation() which has been
taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in
EState.es_top_eflags.  That includes all ExecInitNode() calls, and a
few other functions that call ExecGetRangeTableRelation() directly,
such as ExecOpenScanRelation().  If ExecPlanStillValid() returns
false, that is, if EState.es_cachedplan is found to have been
invalidated after a lock being taken by ExecGetRangeTableRelation(),
whatever funcion called it must return immediately and so must its
caller and so on.  ExecEndPlan() seems to be able to clean up after a
partially finished attempt of initializing a PlanState tree in this
way.  Maybe my preliminary testing didn't catch cases where pointers
to resources that are normally put into the nodes of a PlanState tree
are now left dangling, because a partially built PlanState tree is not
accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in
such cases.  Maybe there's only es_tupleTable and es_relations that
needs to be explicitly released and the rest is taken care of by
resetting the ExecutorState context.

On testing, I'm afraid we're going to need something like
src/test/modules/delay_execution to test that concurrent changes to
relation(s) in PlannedStmt.relationOids that occur somewhere between
RevalidateCachedQuery() and InitPlan() result in the latter to be
aborted and that it is handled correctly.  It seems like it is only
the locking of partitions (that are not present in an unplanned Query
and thus not protected by AcquirePlannerLocks()) that can trigger
replanning of a CachedPlan, so any tests we write should involve
partitions.  Should this try to test as many plan shapes as possible
though given the uncertainty around ExecEndPlan() robustness or should
manual auditing suffice to be sure that nothing's broken?

On possibly needing to move permission checking to occur *after*
taking locks, I realized that we don't really need to, because no
relation that needs its permissions should be unlocked by the time we
get to ExecCheckPermissions(); note we only check permissions of
tables that are present in the original parse tree and
RevalidateCachedQuery() should have locked those.  I found a couple of
exceptions to that invariant in that views sometimes appear not to be
in the set of relations that RevalidateCachedQuery() locks.  So, I
invented PlannedStmt.viewRelations, a list of RT indexes of view RTEs
that is populated in setrefs.c. ExecLockViewRelations() called before
ExecCheckPermissions() locks those.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v32-0001-Move-AcquireExecutorLocks-s-responsibility-into-.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

03 February 2023, 13:01:09

On Thu, Feb 2, 2023 at 11:49 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I didn't actually go with calling the plancache on every lock taken on
> > a relation, that is, in ExecGetRangeTableRelation().  One thing about
> > doing it that way that I didn't quite like (or didn't see a clean
> > enough way to code) is the need to complicate the ExecInitNode()
> > traversal for handling the abrupt suspension of the ongoing setup of
> > the PlanState tree.
>
> OK, I gave this one more try and attached is what I came up with.
>
> This adds a ExecPlanStillValid(), which is called right after anything
> that may in turn call ExecGetRangeTableRelation() which has been
> taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in
> EState.es_top_eflags.  That includes all ExecInitNode() calls, and a
> few other functions that call ExecGetRangeTableRelation() directly,
> such as ExecOpenScanRelation().  If ExecPlanStillValid() returns
> false, that is, if EState.es_cachedplan is found to have been
> invalidated after a lock being taken by ExecGetRangeTableRelation(),
> whatever funcion called it must return immediately and so must its
> caller and so on.  ExecEndPlan() seems to be able to clean up after a
> partially finished attempt of initializing a PlanState tree in this
> way.  Maybe my preliminary testing didn't catch cases where pointers
> to resources that are normally put into the nodes of a PlanState tree
> are now left dangling, because a partially built PlanState tree is not
> accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in
> such cases.  Maybe there's only es_tupleTable and es_relations that
> needs to be explicitly released and the rest is taken care of by
> resetting the ExecutorState context.

In the attached updated patch, I've made the functions that check
ExecPlanStillValid() to return NULL (if returning something) instead
of returning partially initialized structs.  Those partially
initialized structs were not being subsequently looked at anyway.

> On testing, I'm afraid we're going to need something like
> src/test/modules/delay_execution to test that concurrent changes to
> relation(s) in PlannedStmt.relationOids that occur somewhere between
> RevalidateCachedQuery() and InitPlan() result in the latter to be
> aborted and that it is handled correctly.  It seems like it is only
> the locking of partitions (that are not present in an unplanned Query
> and thus not protected by AcquirePlannerLocks()) that can trigger
> replanning of a CachedPlan, so any tests we write should involve
> partitions.  Should this try to test as many plan shapes as possible
> though given the uncertainty around ExecEndPlan() robustness or should
> manual auditing suffice to be sure that nothing's broken?

I've added a test case under src/modules/delay_execution by adding a
new ExecutorStart_hook that works similarly as
delay_execution_planner().  The test works by allowing a concurrent
session to drop an object being referenced in a cached plan being
initialized while the ExecutorStart_hook waits to get an advisory
lock.  The concurrent drop of the referenced object is detected during
ExecInitNode() and thus triggers replanning of the cached plan.

I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v33-0001-Move-AcquireExecutorLocks-s-responsibility-into-.patch

Re: generic plans and "initial" pruning

From

Andres Freund

Date:

07 February 2023, 18:08:55

Hi,

On 2023-02-03 22:01:09 +0900, Amit Langote wrote:
> I've added a test case under src/modules/delay_execution by adding a
> new ExecutorStart_hook that works similarly as
> delay_execution_planner().  The test works by allowing a concurrent
> session to drop an object being referenced in a cached plan being
> initialized while the ExecutorStart_hook waits to get an advisory
> lock.  The concurrent drop of the referenced object is detected during
> ExecInitNode() and thus triggers replanning of the cached plan.
> 
> I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

The tests seem to frequently hang on freebsd:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478

Greetings,

Andres Freund

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

08 February 2023, 10:31:30

On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-02-03 22:01:09 +0900, Amit Langote wrote:
> I've added a test case under src/modules/delay_execution by adding a
> new ExecutorStart_hook that works similarly as
> delay_execution_planner(). The test works by allowing a concurrent
> session to drop an object being referenced in a cached plan being
> initialized while the ExecutorStart_hook waits to get an advisory
> lock. The concurrent drop of the referenced object is detected during
> ExecInitNode() and thus triggers replanning of the cached plan.
>
> I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

The tests seem to frequently hang on freebsd:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478

Thanks for the heads up. I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for some other failures (on other cirrus machines). Has anyone else been a similar situation?

Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

02 March 2023, 13:52:53

On Wed, Feb 8, 2023 at 7:31 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote:
>> The tests seem to frequently hang on freebsd:
>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478
>
> Thanks for the heads up.  I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for
someother failures (on other cirrus machines).  Has anyone else been a similar situation? 

I think I have figured out what might be going wrong on that cfbot
animal after building with the same CPPFLAGS as that animal locally.
I had forgotten to update _out/_readRangeTblEntry() to account for the
patch's change that a view's RTE_SUBQUERY now also preserves relkind
in addition to relid and rellockmode for the locking consideration.

Also, I noticed that a multi-query Portal execution with rules was
failing (thanks to a regression test added in a7d71c41db) because of
the snapshot used for the 2nd query onward not being updated for
command ID change under patched model of multi-query Portal execution.
To wit, under the patched model, all queries in the multi-query Portal
case undergo ExecutorStart() before any of it is run with
ExecutorRun().  The patch hadn't changed things however to update the
snapshot's command ID for the 2nd query onwards, which caused the
aforementioned test case to fail.

This new model does however mean that the 2nd query onwards must use
PushCopiedSnapshot() given the current requirement of
UpdateActiveSnapshotCommandId() that the snapshot passed to it must
not be referenced anywhere else.  The new model basically requires
that each query's QueryDesc points to its own copy of the
ActiveSnapshot.  That may not be a thing in favor of the patched model
though.  For now, I haven't been able to come up with a better
alternative.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

v34-0001-Move-AcquireExecutorLocks-s-responsibility-into-.patch

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

14 March 2023, 10:07:41

On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I think I have figured out what might be going wrong on that cfbot
> animal after building with the same CPPFLAGS as that animal locally.
> I had forgotten to update _out/_readRangeTblEntry() to account for the
> patch's change that a view's RTE_SUBQUERY now also preserves relkind
> in addition to relid and rellockmode for the locking consideration.
>
> Also, I noticed that a multi-query Portal execution with rules was
> failing (thanks to a regression test added in a7d71c41db) because of
> the snapshot used for the 2nd query onward not being updated for
> command ID change under patched model of multi-query Portal execution.
> To wit, under the patched model, all queries in the multi-query Portal
> case undergo ExecutorStart() before any of it is run with
> ExecutorRun().  The patch hadn't changed things however to update the
> snapshot's command ID for the 2nd query onwards, which caused the
> aforementioned test case to fail.
>
> This new model does however mean that the 2nd query onwards must use
> PushCopiedSnapshot() given the current requirement of
> UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> not be referenced anywhere else.  The new model basically requires
> that each query's QueryDesc points to its own copy of the
> ActiveSnapshot.  That may not be a thing in favor of the patched model
> though.  For now, I haven't been able to come up with a better
> alternative.

Here's a new version addressing the following 2 points.

* Like views, I realized that non-leaf relations of partition trees
scanned by an Append/MergeAppend would need to be locked separately,
because ExecInitNode() traversal of the plan tree would not account
for them.  That is, they are not opened using
ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
is that some (if not all) of those non-leaf relations may be
referenced in PartitionPruneInfo and so locked as part of initializing
the corresponding PartitionPruneState, but I decided not to complicate
the code to filter out such relations from the set locked separately.
To carry the set of relations to lock, the refactoring patch 0001
re-introduces the List of Bitmapset field named allpartrelids into
Append/MergeAppend nodes, which we had previously removed on the
grounds that those relations need not be locked separately (commits
f2343653f5b, f003a7522bf).

* I decided to initialize QueryDesc.planstate even in the cases where
ExecInitNode() traversal is aborted in the middle on detecting
CachedPlan invalidation such that it points to a partially initialized
PlanState tree.  My earlier thinking that each PlanState node need not
be visited for resource cleanup in such cases was naive after all.  To
that end, I've fixed the ExecEndNode() subroutines of all Plan node
types to account for potentially uninitialized fields.  There are a
couple of cases where I'm a bit doubtful though.  In
ExecEndCustomScan(), there's no indication in CustomScanState whether
it's OK to call EndCustomScan() when BeginCustomScan() may not have
been called.  For ForeignScanState, I've assumed that
ForeignScanState.fdw_state being set can be used as a marker that
BeginForeignScan would have been called, though maybe that's not a
solid assumption.

I'm also attaching a new (small) patch 0003 that eliminates the
loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
iterating over a new List field of EState named es_opened_relations,
which is populated by ExecGetRangeTableRelation() with only the
relations that were opened.  This speeds up
ExecCloseRangeTableRelations() significantly for the cases with many
runtime-prunable partitions.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

22 March 2023, 12:48:49

On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I think I have figured out what might be going wrong on that cfbot
> > animal after building with the same CPPFLAGS as that animal locally.
> > I had forgotten to update _out/_readRangeTblEntry() to account for the
> > patch's change that a view's RTE_SUBQUERY now also preserves relkind
> > in addition to relid and rellockmode for the locking consideration.
> >
> > Also, I noticed that a multi-query Portal execution with rules was
> > failing (thanks to a regression test added in a7d71c41db) because of
> > the snapshot used for the 2nd query onward not being updated for
> > command ID change under patched model of multi-query Portal execution.
> > To wit, under the patched model, all queries in the multi-query Portal
> > case undergo ExecutorStart() before any of it is run with
> > ExecutorRun().  The patch hadn't changed things however to update the
> > snapshot's command ID for the 2nd query onwards, which caused the
> > aforementioned test case to fail.
> >
> > This new model does however mean that the 2nd query onwards must use
> > PushCopiedSnapshot() given the current requirement of
> > UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> > not be referenced anywhere else.  The new model basically requires
> > that each query's QueryDesc points to its own copy of the
> > ActiveSnapshot.  That may not be a thing in favor of the patched model
> > though.  For now, I haven't been able to come up with a better
> > alternative.
>
> Here's a new version addressing the following 2 points.
>
> * Like views, I realized that non-leaf relations of partition trees
> scanned by an Append/MergeAppend would need to be locked separately,
> because ExecInitNode() traversal of the plan tree would not account
> for them.  That is, they are not opened using
> ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
> is that some (if not all) of those non-leaf relations may be
> referenced in PartitionPruneInfo and so locked as part of initializing
> the corresponding PartitionPruneState, but I decided not to complicate
> the code to filter out such relations from the set locked separately.
> To carry the set of relations to lock, the refactoring patch 0001
> re-introduces the List of Bitmapset field named allpartrelids into
> Append/MergeAppend nodes, which we had previously removed on the
> grounds that those relations need not be locked separately (commits
> f2343653f5b, f003a7522bf).
>
> * I decided to initialize QueryDesc.planstate even in the cases where
> ExecInitNode() traversal is aborted in the middle on detecting
> CachedPlan invalidation such that it points to a partially initialized
> PlanState tree.  My earlier thinking that each PlanState node need not
> be visited for resource cleanup in such cases was naive after all.  To
> that end, I've fixed the ExecEndNode() subroutines of all Plan node
> types to account for potentially uninitialized fields.  There are a
> couple of cases where I'm a bit doubtful though.  In
> ExecEndCustomScan(), there's no indication in CustomScanState whether
> it's OK to call EndCustomScan() when BeginCustomScan() may not have
> been called.  For ForeignScanState, I've assumed that
> ForeignScanState.fdw_state being set can be used as a marker that
> BeginForeignScan would have been called, though maybe that's not a
> solid assumption.
>
> I'm also attaching a new (small) patch 0003 that eliminates the
> loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
> iterating over a new List field of EState named es_opened_relations,
> which is populated by ExecGetRangeTableRelation() with only the
> relations that were opened.  This speeds up
> ExecCloseRangeTableRelations() significantly for the cases with many
> runtime-prunable partitions.

Here's another version with some cosmetic changes, like fixing some
factually incorrect / obsolete comments and typos that I found.  I
also noticed that I had missed noting near some table_open() that
locks taken with those can't possibly invalidate a plan (such as
lazily opened partition routing target partitions) and thus need the
treatment that locking during execution initialization requires.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

27 March 2023, 08:18:20

On Wed, Mar 22, 2023 at 9:48 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > I think I have figured out what might be going wrong on that cfbot
> > > animal after building with the same CPPFLAGS as that animal locally.
> > > I had forgotten to update _out/_readRangeTblEntry() to account for the
> > > patch's change that a view's RTE_SUBQUERY now also preserves relkind
> > > in addition to relid and rellockmode for the locking consideration.
> > >
> > > Also, I noticed that a multi-query Portal execution with rules was
> > > failing (thanks to a regression test added in a7d71c41db) because of
> > > the snapshot used for the 2nd query onward not being updated for
> > > command ID change under patched model of multi-query Portal execution.
> > > To wit, under the patched model, all queries in the multi-query Portal
> > > case undergo ExecutorStart() before any of it is run with
> > > ExecutorRun().  The patch hadn't changed things however to update the
> > > snapshot's command ID for the 2nd query onwards, which caused the
> > > aforementioned test case to fail.
> > >
> > > This new model does however mean that the 2nd query onwards must use
> > > PushCopiedSnapshot() given the current requirement of
> > > UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> > > not be referenced anywhere else.  The new model basically requires
> > > that each query's QueryDesc points to its own copy of the
> > > ActiveSnapshot.  That may not be a thing in favor of the patched model
> > > though.  For now, I haven't been able to come up with a better
> > > alternative.
> >
> > Here's a new version addressing the following 2 points.
> >
> > * Like views, I realized that non-leaf relations of partition trees
> > scanned by an Append/MergeAppend would need to be locked separately,
> > because ExecInitNode() traversal of the plan tree would not account
> > for them.  That is, they are not opened using
> > ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
> > is that some (if not all) of those non-leaf relations may be
> > referenced in PartitionPruneInfo and so locked as part of initializing
> > the corresponding PartitionPruneState, but I decided not to complicate
> > the code to filter out such relations from the set locked separately.
> > To carry the set of relations to lock, the refactoring patch 0001
> > re-introduces the List of Bitmapset field named allpartrelids into
> > Append/MergeAppend nodes, which we had previously removed on the
> > grounds that those relations need not be locked separately (commits
> > f2343653f5b, f003a7522bf).
> >
> > * I decided to initialize QueryDesc.planstate even in the cases where
> > ExecInitNode() traversal is aborted in the middle on detecting
> > CachedPlan invalidation such that it points to a partially initialized
> > PlanState tree.  My earlier thinking that each PlanState node need not
> > be visited for resource cleanup in such cases was naive after all.  To
> > that end, I've fixed the ExecEndNode() subroutines of all Plan node
> > types to account for potentially uninitialized fields.  There are a
> > couple of cases where I'm a bit doubtful though.  In
> > ExecEndCustomScan(), there's no indication in CustomScanState whether
> > it's OK to call EndCustomScan() when BeginCustomScan() may not have
> > been called.  For ForeignScanState, I've assumed that
> > ForeignScanState.fdw_state being set can be used as a marker that
> > BeginForeignScan would have been called, though maybe that's not a
> > solid assumption.
> >
> > I'm also attaching a new (small) patch 0003 that eliminates the
> > loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
> > iterating over a new List field of EState named es_opened_relations,
> > which is populated by ExecGetRangeTableRelation() with only the
> > relations that were opened.  This speeds up
> > ExecCloseRangeTableRelations() significantly for the cases with many
> > runtime-prunable partitions.
>
> Here's another version with some cosmetic changes, like fixing some
> factually incorrect / obsolete comments and typos that I found.  I
> also noticed that I had missed noting near some table_open() that
> locks taken with those can't possibly invalidate a plan (such as
> lazily opened partition routing target partitions) and thus need the
> treatment that locking during execution initialization requires.

Rebased over 3c05284d83b2 ("Invent GENERIC_PLAN option for EXPLAIN.").

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

27 March 2023, 14:00:38

> > On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > * I decided to initialize QueryDesc.planstate even in the cases where
> > > ExecInitNode() traversal is aborted in the middle on detecting
> > > CachedPlan invalidation such that it points to a partially initialized
> > > PlanState tree.  My earlier thinking that each PlanState node need not
> > > be visited for resource cleanup in such cases was naive after all.  To
> > > that end, I've fixed the ExecEndNode() subroutines of all Plan node
> > > types to account for potentially uninitialized fields.  There are a
> > > couple of cases where I'm a bit doubtful though.  In
> > > ExecEndCustomScan(), there's no indication in CustomScanState whether
> > > it's OK to call EndCustomScan() when BeginCustomScan() may not have
> > > been called.  For ForeignScanState, I've assumed that
> > > ForeignScanState.fdw_state being set can be used as a marker that
> > > BeginForeignScan would have been called, though maybe that's not a
> > > solid assumption.

It seems I hadn't noted in the ExecEndNode()'s comment that all node
types' recursive subroutines need to  handle the change made by this
patch that the corresponding ExecInitNode() subroutine may now return
early without having initialized all state struct fields.

Also noted in the documentation for CustomScan and ForeignScan that
the Begin*Scan callback may not have been called at all, so the
End*Scan should handle that gracefully.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Tue, Apr 4, 2023 at 10:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A few concrete thoughts:
> >
> > * I understand that your plan now is to acquire locks on all the
> > originally-named tables, then do permissions checks (which will
> > involve only those tables), then dynamically lock just inheritance and
> > partitioning child tables as we descend the plan tree.
>
> Actually, with the current implementation of the patch, *all* of the
> relations mentioned in the plan tree would get locked during the
> ExecInitNode() traversal of the plan tree (and of those in
> plannedstmt->subplans), not just the inheritance child tables.
> Locking of non-child tables done by the executor after this patch is
> duplicative with AcquirePlannerLocks(), so that's something to be
> improved.
>
> > That seems
> > more or less okay to me, but it could be reflected better in the
> > structure of the patch perhaps.
> >
> > * In particular I don't much like the "viewRelations" list, which
> > seems like a wart; those ought to be handled more nearly the same way
> > as other RTEs. (One concrete reason why is that this scheme is going
> > to result in locking views in a different order than they were locked
> > during original parsing, which perhaps could contribute to deadlocks.)
> > Maybe we should store an integer list of which RTIs need to be locked
> > in the early phase? Building that in the parser/rewriter would provide
> > a solid guide to the original locking order, so we'd be trivially sure
> > of duplicating that. (It might be close enough to follow the RT list
> > order, which is basically what AcquireExecutorLocks does today, but
> > this'd be more certain to do the right thing.) I'm less concerned
> > about lock order for child tables because those are just going to
> > follow the inheritance or partitioning structure.
>
> What you've described here sounds somewhat like what I had implemented
> in the patch versions till v31, though it used a bitmapset named
> minLockRelids that is initialized by setrefs.c. Your idea of
> initializing a list before planning seems more appealing offhand than
> the code I had added in setrefs.c to populate that minLockRelids
> bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)),
> followed by bms_del_members(set-of-child-rel-rtis).
>
> I'll give your idea a try.

After sleeping on this, I think we perhaps don't need to remember originally-named relations if only for the purpose of locking them for execution. That's because, for a reused (cached) plan, AcquirePlannerLocks() would have taken those locks anyway.

AcquirePlannerLocks() doesn't lock inheritance children because they would be added to the range table by the planner, so they should be locked separately for execution, if needed. I thought taking the execution-time locks only when inside ExecInit[Merge]Append would work, but then we have cases where single-child Append/MergeAppend are stripped of the Append/MergeAppend nodes by setrefs.c. Maybe we need a place to remember such child relations, that is, only in the cases where Append/MergeAppend elision occurs, in something maybe esoteric-sounding like PlannedStmt.elidedAppendChildRels or something?

Another set of child relations that are not covered by Append/MergeAppend child nodes is non-leaf partitions. I've proposed adding a List of Bitmapset field to Append/MergeAppend named 'allpartrelids' as part of this patchset (patch 0001) to track those for execution-time locking.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

08 June 2023, 14:23:21

Here is a new version.  Summary of main changes since the last version
that Tom reviewed back in April:

* ExecInitNode() subroutines now return NULL (as opposed to a
partially initialized PlanState node as in the last version) upon
detecting that the CachedPlan that the plan tree is from is no longer
valid due to invalidation messages processed upon taking locks.  Plan
tree subnodes that are fully initialized till the point of detection
are added by ExecInitNode() into a List in EState called
es_inited_plannodes.  ExecEndPlan() now iterates over that list to
close each one individually using ExecEndNode().  ExecEndNode() or its
subroutines thus no longer need to be recursive to close the child
nodes.  Also, with this design, there is no longer the possibility of
partially initialized PlanState trees with partially initialized
individual PlanState nodes, so the ExecEndNode() subroutine changes
that were in the last version to account for partial initialization
are not necessary.

* Instead of setting EXEC_FLAG_GET_LOCKS in es_top_eflags for the
entire duration of InitPlan(), it is now only set in ExecInitAppend()
and ExecInitMergeAppend(), because that's where the subnodes scanning
child tables would be and the executor only needs to lock child tables
to validate a CachedPlan in a race-free manner.  Parent tables that
appear in the query would have been locked by AcquirePlannerLocks().
Child tables whose scan subnodes don't appear under Append/MergeAppend
(due to the latter being removed by setrefs.c for there being only a
single child) are identified in PlannedStmt.elidedAppendChildRelations
and InitPlan() locks each one found there if the plan tree is from a
CachedPlan.

* There's no longer PlannedStmt.viewRelations, because view relations
need not be tracked separately for locking as AcquirePlannerLocks()
covers them.

Attachment

Re: generic plans and "initial" pruning

From

Daniel Gustafsson

Date:

03 July 2023, 13:27:22

> On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> 
> Here is a new version.

The local planstate variable in the hunk below is shadowing the function
parameter planstate which cause a compiler warning:

@@ -1495,18 +1556,15 @@ ExecEndPlan(PlanState *planstate, EState *estate)
     ListCell   *l;
 
     /*
-     * shut down the node-type-specific query processing
-     */
-    ExecEndNode(planstate);
-
-    /*
-     * for subplans too
+     * Shut down the node-type-specific query processing for all nodes that
+     * were initialized during InitPlan(), both in the main plan tree and those
+     * in subplans (es_subplanstates), if any.
      */
-    foreach(l, estate->es_subplanstates)
+    foreach(l, estate->es_inited_plannodes)
     {
-        PlanState  *subplanstate = (PlanState *) lfirst(l);
+        PlanState  *planstate = (PlanState *) lfirst(l);

--
Daniel Gustafsson

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

06 July 2023, 14:29:10

On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > Here is a new version.
>
> The local planstate variable in the hunk below is shadowing the function
> parameter planstate which cause a compiler warning:

Thanks Daniel for the heads up.

Attached new version fixes that and contains a few other notable
changes. Before going into the details of those changes, let me
reiterate in broad strokes what the patch is trying to do.

The idea is to move the locking of some tables referenced in a cached
(generic) plan from plancache/GetCachedPlan() to the
executor/ExecutorStart(). Specifically, the locking of inheritance
child tables. Why? Because partition pruning with "initial pruning
steps" contained in the Append/MergeAppend nodes may eliminate some
child tables that need not have been locked to begin with, though the
pruning can only occur during ExecutorStart().

After applying this patch, GetCachedPlan() only locks the tables that
are directly mentioned in the query to ensure that the
analyzed-rewritten-but-unplanned query tree backing a given CachedPlan
is still valid (cf RevalidateCachedQuery()), but not the tables in the
CachedPlan that would have been added by the planner. Tables in a
CachePlan that would not be locked currently only include the
inheritance child tables / partitions of the tables mentioned in the
query. This means that the plan trees in a given CachedPlan returned
by GetCachedPlan() are only partially valid and are subject to
invalidation because concurrent sessions can possibly modify the child
tables referenced in them before ExecutorStart() gets around to
locking them. If the concurrent modifications do happen,
ExecutorStart() is now equipped to detect them by way of noticing that
the CachedPlan is invalidated and inform the caller to discard and
recreate the CachedPlan. This entails changing all the call sites of
ExecutorStart() that pass it a plan tree from a CachedPlan to
implement the replan-and-retry-execution loop.

Given the above, ExecutorStart(), which has not needed so far to take
any locks (except on indexes mentioned in IndexScans), now needs to
lock child tables if executing a cached plan which contains them. In
the previous versions, the patch used a flag passed in
EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the
table. The flag would be set in ExecInitAppend() and
ExecInitMergeAppend() for the duration of the loop that initializes
child subplans with the assumption that that's where the child tables
would be opened. But not all child subplans of Append/MergeAppend
scan child tables (think UNION ALL queries), so this approach can
result in redundant locking. Worse, I needed to invent
PlannedStmt.elidedAppendChildRelations to separately track child
tables whose Scan nodes' parent Append/MergeAppend would be removed by
setrefs.c in some cases.

So, this new patch uses a flag in the RangeTblEntry itself to denote
if the table is a child table instead of the above roundabout way.
ExecGetRangeTableRelation() can simply look at the RTE to decide
whether to take a lock or not. I considered adding a new bool field,
but noticed we already have inFromCl to track if a given RTE is for
table/entity directly mentioned in the query or for something added
behind-the-scenes into the range table as the field's description in
parsenodes.h says. RTEs for child tables are added behind-the-scenes
by the planner and it makes perfect sense to me to mark their inFromCl
as false. I can't find anything that relies on the current behavior
of inFromCl being set to the same value as the root inheritance parent
(true). Patch 0002 makes this change for child RTEs.

A few other notes:

* A parallel worker does ExecutorStart() without access to the
CachedPlan that the leader may have gotten its plan tree from. This
means that parallel workers do not have the ability to detect plan
tree invalidations. I think that's fine, because if the leader would
have been able to launch workers at all, it would also have gotten all
the locks to protect the (portion of) the plan tree that the workers
would be executing. I had an off-list discussion about this with
Robert and he mentioned his concern that each parallel worker would
have its own view of which child subplans of a parallel Append are
"valid" that depends on the result of its own evaluation of initial
pruning. So, there may be race conditions whereby a worker may try
to execute plan nodes that are no longer valid, for example, if the
partition a worker considers valid is not viewed as such by the leader
and thus not locked. I shared my thoughts as to why that sounds
unlikely at [1], though maybe I'm a bit too optimistic?

* For multi-query portals, you can't now do ExecutorStart()
immediately followed by ExecutorRun() for each query in the portal,
because ExecutorStart() may now fail to start a plan if it gets
invalidated. So PortalStart() now does ExecutorStart()s for all
queries and remembers the QueryDescs for PortalRun() then to do
ExecutorRun()s using. A consequence of this is that
CommandCounterIncrement() now must be done between the
ExecutorStart()s of the individual plans in PortalStart() and not
between the ExecutorRun()s in PortalRunMulti(). make check-world
passes with this new arrangement, though I'm not entirely confident
that there are no problems lurking.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://postgr/es/m/CA+HiwqFA=swkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw@mail.gmail.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

13 July 2023, 12:58:38

On Thu, Jul 6, 2023 at 11:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> > > Here is a new version.
> >
> > The local planstate variable in the hunk below is shadowing the function
> > parameter planstate which cause a compiler warning:
>
> Thanks Daniel for the heads up.
>
> Attached new version fixes that and contains a few other notable
> changes.  Before going into the details of those changes, let me
> reiterate in broad strokes what the patch is trying to do.
>
> The idea is to move the locking of some tables referenced in a cached
> (generic) plan from plancache/GetCachedPlan() to the
> executor/ExecutorStart().  Specifically, the locking of inheritance
> child tables.  Why?  Because partition pruning with "initial pruning
> steps" contained in the Append/MergeAppend nodes may eliminate some
> child tables that need not have been locked to begin with, though the
> pruning can only occur during ExecutorStart().
>
> After applying this patch, GetCachedPlan() only locks the tables that
> are directly mentioned in the query to ensure that the
> analyzed-rewritten-but-unplanned query tree backing a given CachedPlan
> is still valid (cf RevalidateCachedQuery()), but not the tables in the
> CachedPlan that would have been added by the planner.  Tables in a
> CachePlan that would not be locked currently only include the
> inheritance child tables / partitions of the tables mentioned in the
> query.  This means that the plan trees in a given CachedPlan returned
> by GetCachedPlan() are only partially valid and are subject to
> invalidation because concurrent sessions can possibly modify the child
> tables referenced in them before ExecutorStart() gets around to
> locking them.  If the concurrent modifications do happen,
> ExecutorStart() is now equipped to detect them by way of noticing that
> the CachedPlan is invalidated and inform the caller to discard and
> recreate the CachedPlan.  This entails changing all the call sites of
> ExecutorStart() that pass it a plan tree from a CachedPlan to
> implement the replan-and-retry-execution loop.
>
> Given the above, ExecutorStart(), which has not needed so far to take
> any locks (except on indexes mentioned in IndexScans), now needs to
> lock child tables if executing a cached plan which contains them.  In
> the previous versions, the patch used a flag passed in
> EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the
> table.  The flag would be set in ExecInitAppend() and
> ExecInitMergeAppend() for the duration of the loop that initializes
> child subplans with the assumption that that's where the child tables
> would be opened.  But not all child subplans of Append/MergeAppend
> scan child tables (think UNION ALL queries), so this approach can
> result in redundant locking.  Worse, I needed to invent
> PlannedStmt.elidedAppendChildRelations to separately track child
> tables whose Scan nodes' parent Append/MergeAppend would be removed by
> setrefs.c in some cases.
>
> So, this new patch uses a flag in the RangeTblEntry itself to denote
> if the table is a child table instead of the above roundabout way.
> ExecGetRangeTableRelation() can simply look at the RTE to decide
> whether to take a lock or not.  I considered adding a new bool field,
> but noticed we already have inFromCl to track if a given RTE is for
> table/entity directly mentioned in the query or for something added
> behind-the-scenes into the range table as the field's description in
> parsenodes.h says.  RTEs for child tables are added behind-the-scenes
> by the planner and it makes perfect sense to me to mark their inFromCl
> as false.  I can't find anything that relies on the current behavior
> of inFromCl being set to the same value as the root inheritance parent
> (true).  Patch 0002 makes this change for child RTEs.
>
> A few other notes:
>
> * A parallel worker does ExecutorStart() without access to the
> CachedPlan that the leader may have gotten its plan tree from.  This
> means that parallel workers do not have the ability to detect plan
> tree invalidations.  I think that's fine, because if the leader would
> have been able to launch workers at all, it would also have gotten all
> the locks to protect the (portion of) the plan tree that the workers
> would be executing.  I had an off-list discussion about this with
> Robert and he mentioned his concern that each parallel worker would
> have its own view of which child subplans of a parallel Append are
> "valid" that depends on the result of its own evaluation of initial
> pruning.   So, there may be race conditions whereby a worker may try
> to execute plan nodes that are no longer valid, for example, if the
> partition a worker considers valid is not viewed as such by the leader
> and thus not locked.  I shared my thoughts as to why that sounds
> unlikely at [1], though maybe I'm a bit too optimistic?
>
> * For multi-query portals, you can't now do ExecutorStart()
> immediately followed by ExecutorRun() for each query in the portal,
> because ExecutorStart() may now fail to start a plan if it gets
> invalidated.   So PortalStart() now does ExecutorStart()s for all
> queries and remembers the QueryDescs for PortalRun() then to do
> ExecutorRun()s using.  A consequence of this is that
> CommandCounterIncrement() now must be done between the
> ExecutorStart()s of the individual plans in PortalStart() and not
> between the ExecutorRun()s in PortalRunMulti().  make check-world
> passes with this new arrangement, though I'm not entirely confident
> that there are no problems lurking.

In an absolutely brown-paper-bag moment, I realized that I had not
updated src/backend/executor/README to reflect the changes to the
executor's control flow that this patch makes.   That is, after
scrapping the old design back in January whose details *were*
reflected in the patches before that redesign.

Anyway, the attached fixes that.

Tom, do you think you have bandwidth in the near future to give this
another look?  I think I've addressed the comments that you had given
back in April, though as mentioned in the previous message, there may
still be some funny-looking aspects still remaining.  In any case, I
have no intention of pressing ahead with the patch without another
committer having had a chance to sign off on it.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Thom Brown

Date:

17 July 2023, 16:32:51

On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> In an absolutely brown-paper-bag moment, I realized that I had not
> updated src/backend/executor/README to reflect the changes to the
> executor's control flow that this patch makes.   That is, after
> scrapping the old design back in January whose details *were*
> reflected in the patches before that redesign.
>
> Anyway, the attached fixes that.
>
> Tom, do you think you have bandwidth in the near future to give this
> another look?  I think I've addressed the comments that you had given
> back in April, though as mentioned in the previous message, there may
> still be some funny-looking aspects still remaining.  In any case, I
> have no intention of pressing ahead with the patch without another
> committer having had a chance to sign off on it.

I've only just started taking a look at this, and my first test drive
yields very impressive results:

8192 partitions (3 runs, 10000 rows)
Head 391.294989 382.622481 379.252236
Patched 13088.145995 13406.135531 13431.828051

Looking at your changes to README, I would like to suggest rewording
the following:

+table during planning.  This means that inheritance child tables, which are
+added to the query's range table during planning, if they are present in a
+cached plan tree would not have been locked.

To:

This means that inheritance child tables present in a cached plan
tree, which are added to the query's range table during planning,
would not have been locked.

Also, further down:

s/intiatialize/initialize/

I'll carry on taking a closer look and see if I can break it.

Thom

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

18 July 2023, 07:26:35

Hi Thom,

On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote:
> On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> > In an absolutely brown-paper-bag moment, I realized that I had not
> > updated src/backend/executor/README to reflect the changes to the
> > executor's control flow that this patch makes.   That is, after
> > scrapping the old design back in January whose details *were*
> > reflected in the patches before that redesign.
> >
> > Anyway, the attached fixes that.
> >
> > Tom, do you think you have bandwidth in the near future to give this
> > another look?  I think I've addressed the comments that you had given
> > back in April, though as mentioned in the previous message, there may
> > still be some funny-looking aspects still remaining.  In any case, I
> > have no intention of pressing ahead with the patch without another
> > committer having had a chance to sign off on it.
>
> I've only just started taking a look at this, and my first test drive
> yields very impressive results:
>
> 8192 partitions (3 runs, 10000 rows)
> Head 391.294989 382.622481 379.252236
> Patched 13088.145995 13406.135531 13431.828051

Just to be sure, did you use pgbench --Mprepared with plan_cache_mode
= force_generic_plan in postgresql.conf?

> Looking at your changes to README, I would like to suggest rewording
> the following:
>
> +table during planning.  This means that inheritance child tables, which are
> +added to the query's range table during planning, if they are present in a
> +cached plan tree would not have been locked.
>
> To:
>
> This means that inheritance child tables present in a cached plan
> tree, which are added to the query's range table during planning,
> would not have been locked.
>
> Also, further down:
>
> s/intiatialize/initialize/
>
> I'll carry on taking a closer look and see if I can break it.

Thanks for looking.  I've fixed these issues in the attached updated
patch.  I've also changed the position of a newly added paragraph in
src/backend/executor/README so that it doesn't break the flow of the
existing text.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Thom Brown

Date:

18 July 2023, 08:36:55

On Tue, 18 Jul 2023, 08:26 Amit Langote, <amitlangote09@gmail.com> wrote:

Hi Thom,

On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote:
> On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> > In an absolutely brown-paper-bag moment, I realized that I had not
> > updated src/backend/executor/README to reflect the changes to the
> > executor's control flow that this patch makes. That is, after
> > scrapping the old design back in January whose details *were*
> > reflected in the patches before that redesign.
> >
> > Anyway, the attached fixes that.
> >
> > Tom, do you think you have bandwidth in the near future to give this
> > another look? I think I've addressed the comments that you had given
> > back in April, though as mentioned in the previous message, there may
> > still be some funny-looking aspects still remaining. In any case, I
> > have no intention of pressing ahead with the patch without another
> > committer having had a chance to sign off on it.
>
> I've only just started taking a look at this, and my first test drive
> yields very impressive results:
>
> 8192 partitions (3 runs, 10000 rows)
> Head 391.294989 382.622481 379.252236
> Patched 13088.145995 13406.135531 13431.828051

Just to be sure, did you use pgbench --Mprepared with plan_cache_mode
= force_generic_plan in postgresql.conf?

I did.

For full disclosure, I also had max_locks_per_transaction set to 10000.

> Looking at your changes to README, I would like to suggest rewording
> the following:
>
> +table during planning. This means that inheritance child tables, which are
> +added to the query's range table during planning, if they are present in a
> +cached plan tree would not have been locked.
>
> To:
>
> This means that inheritance child tables present in a cached plan
> tree, which are added to the query's range table during planning,
> would not have been locked.
>
> Also, further down:
>
> s/intiatialize/initialize/
>
> I'll carry on taking a closer look and see if I can break it.

Thanks for looking. I've fixed these issues in the attached updated
patch. I've also changed the position of a newly added paragraph in
src/backend/executor/README so that it doesn't break the flow of the
existing text.

Thanks.

Thom

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

02 August 2023, 13:39:45

While chatting with Robert about this patch set, he suggested that it
would be better to break out some executor refactoring changes from
the main patch (0003) into a separate patch.  To wit, the changes to
make the PlanState tree cleanup in ExecEndPlan() non-recursive by
walking a flat list of PlanState nodes instead of the recursive tree
walk that ExecEndNode() currently does.  That allows us to cleanly
handle the cases where the PlanState tree is only partially
constructed when ExecInitNode() detects in the middle of its
construction that the plan tree is no longer valid after receiving and
processing an invalidation message on locking child tables.  Or at
least more cleanly than the previously proposed approach of adjusting
ExecEndNode() subroutines for the individual node types to gracefully
handle such partially initialized PlanState trees.

With the new approach, node type specific subroutines of ExecEndNode()
need not close its child nodes, because ExecEndPlan() would close each
node that would have been initialized directly.  I couldn't find any
instance of breakage by this decoupling of child node cleanup from
their parent node's cleanup.  Comments in ExecEndGather() and
ExecEndGatherMerge() appear to suggest that outerPlan must be closed
before the local cleanup:

 void
 ExecEndGather(GatherState *node)
 {
-   ExecEndNode(outerPlanState(node));  /* let children clean up first */
+   /* outerPlan is closed separately. */
    ExecShutdownGather(node);
    ExecFreeExprContext(&node->ps);

But I don't think there's a problem, because what ExecShutdownGather()
does seems entirely independent of cleanup of outerPlan.

As for the performance impact of initializing the list of initialized
nodes to use during the cleanup phase, I couldn't find a regression,
nor any improvement by replacing the tree walk by linear scan of a
list.  Actually, ExecEndNode() is pretty far down in the perf profile
anyway, so the performance difference caused by the patch hardly
matters.  See the following contrived example:

create table f();
analyze f;
explain (costs off) select count(*) from f f1, f f2, f f3, f f4, f f5,
f f6, f f7, f f8, f f9, f f10;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Aggregate
   ->  Nested Loop
         ->  Nested Loop
               ->  Nested Loop
                     ->  Nested Loop
                           ->  Nested Loop
                                 ->  Nested Loop
                                       ->  Nested Loop
                                             ->  Nested Loop
                                                   ->  Nested Loop
                                                         ->  Seq Scan on f f1
                                                         ->  Seq Scan on f f2
                                                   ->  Seq Scan on f f3
                                             ->  Seq Scan on f f4
                                       ->  Seq Scan on f f5
                                 ->  Seq Scan on f f6
                           ->  Seq Scan on f f7
                     ->  Seq Scan on f f8
               ->  Seq Scan on f f9
         ->  Seq Scan on f f10
(20 rows)

do $$
begin
for i in 1..100000 loop
perform count(*) from f f1, f f2, f f3, f f4, f f5, f f6, f f7, f f8,
f f9, f f10;
end loop;
end; $$;

Times for the DO:

Unpatched:
Time: 756.353 ms
Time: 745.752 ms
Time: 749.184 ms

Patched:
Time: 737.717 ms
Time: 747.815 ms
Time: 753.456 ms

I've attached the new refactoring patch as 0001.

Another change I've made in the main patch is to change the API of
ExecutorStart() (and ExecutorStart_hook) more explicitly to return a
boolean indicating whether or not the plan initialization was
successful.  That way seems better than making the callers figure that
out by seeing that QueryDesc.planstate is NULL and/or checking
QueryDesc.plan_valid.  Correspondingly, PortalStart() now also returns
true or false matching what ExecutorStart() returned.  I suppose this
better alerts any extensions that use the ExecutorStart_hook to fix
their code to do the right thing.

Having extracted the ExecEndNode() change, I'm also starting to feel
inclined to extract a couple of other bits from the main patch as
separate patches, such as moving the ExecutorStart() call from
PortalRun() to PortalStart() for the multi-query portals.  I'll do
that in the next version.

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

03 August 2023, 08:37:39

On Wed, Aug 2, 2023 at 10:39 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Having extracted the ExecEndNode() change, I'm also starting to feel
> inclined to extract a couple of other bits from the main patch as
> separate patches, such as moving the ExecutorStart() call from
> PortalRun() to PortalStart() for the multi-query portals.  I'll do
> that in the next version.

Here's a patch set where the refactoring to move the ExecutorStart()
calls to be closer to GetCachedPlan() (for the call sites that use a
CachedPlan) is extracted into a separate patch, 0002.  Its commit
message notes an aspect of this refactoring that I feel a bit nervous
about -- needing to also move the CommandCounterIncrement() call from
the loop in PortalRunMulti() to PortalStart() which now does
ExecutorStart() for the PORTAL_MULTI_QUERY case.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > But should ExecInitNode() subroutines return the partially initialized
> > PlanState node or NULL on detecting invalidation?  If I'm
> > understanding how you think this should be working correctly, I think
> > you mean the former, because if it were the latter, ExecInitNode()
> > would end up returning NULL at the top for the root and then there's
> > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > In that case, I think we will need to adjust ExecEndNode() subroutines
> > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > example.  That's something Tom had said he doesn't like very much [1].
>
> Yeah, I understood Tom's goal as being "don't return partially
> initialized nodes."
>
> Personally, I'm not sure that's an important goal. In fact, I don't
> even think it's a desirable one. It doesn't look difficult to audit
> the end-node functions for cases where they'd fail if a particular
> pointer were NULL instead of pointing to some real data, and just
> fixing all such cases to have NULL-tests looks like purely mechanical
> work that we are unlikely to get wrong. And at least some cases
> wouldn't require any changes at all.
>
> If we don't do that, the complexity doesn't go away. It just moves
> someplace else. Presumably what we do in that case is have
> ExecInitNode functions undo any initialization that they've already
> done before returning NULL. There are basically two ways to do that.
> Option one is to add code at the point where they return early to
> clean up anything they've already initialized, but that code is likely
> to substantially duplicate whatever the ExecEndNode function already
> knows how to do, and it's very easy for logic like this to get broken
> if somebody rearranges an ExecInitNode function down the road.

Yeah, I too am not a fan of making ExecInitNode() clean up partially
initialized nodes.

> Option
> two is to rearrange the ExecInitNode functions now, to open relations
> or recurse at the beginning, so that we discover the need to fail
> before we initialize anything. That restricts our ability to further
> rearrange the functions in future somewhat, but more importantly,
> IMHO, it introduces more risk right now. Checking that the ExecEndNode
> function will not fail if some pointers are randomly null is a lot
> easier than checking that changing the order of operations in an
> ExecInitNode function breaks nothing.
>
> I'm not here to say that we can't do one of those things. But I think
> adding null-tests to ExecEndNode functions looks like *far* less work
> and *way* less risk.

+1

> There's a second issue here, too, which is when we abort ExecInitNode
> partway through, how do we signal that? You're rightly pointing out
> here that if we do that by returning NULL, then we don't do it by
> returning a pointer to the partially initialized node that we just
> created, which means that we either need to store those partially
> initialized nodes in a separate data structure as you propose to do in
> 0001,
>
> or else we need to pick a different signalling convention. We
> could change (a) ExecInitNode to have an additional argument, bool
> *kaboom, or (b) we could make it return bool and return the node
> pointer via a new additional argument, or (c) we could put a Boolean
> flag into the estate and let the function signal failure by flipping
> the value of the flag.

The failure can already be detected by seeing that
ExecPlanIsValid(estate) is false.  The question is what ExecInitNode()
or any of its subroutines should return once it is.  I think the
following convention works:

Return partially initialized state from ExecInit* function where we
detect the invalidation after calling ExecInitNode() on a child plan,
so that ExecEndNode() can recurse to clean it up.

Return NULL from ExecInit* functions where we detect the invalidation
after opening and locking a relation but before calling ExecInitNode()
to initialize a child plan if there's one at all.  Even if we may set
things like ExprContext, TupleTableSlot fields, they are cleaned up
independently of the plan tree anyway via the cleanup called with
es_exprcontexts, es_tupleTable, respectively.  I even noticed bits
like this in ExecEnd* functions:

-   /*
-    * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
-    */
-#ifdef NOT_USED
-   ExecFreeExprContext(&node->ss.ps);
-   if (node->ioss_RuntimeContext)
-       FreeExprContext(node->ioss_RuntimeContext, true);
-#endif

So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
is unnecessary but remain around because nobody cared about and got
around to getting rid of it.

> If we do any of those things, then as far as I
> can see 0001 is unnecessary. If we do none of them but also avoid
> creating partially initialized nodes by one of the two techniques
> mentioned two paragraphs prior, then 0001 is also unnecessary. If we
> do none of them but do create partially initialized nodes, then we
> need 0001.
>
> So if this were a restaurant menu, then it might look like this:
>
> Prix Fixe Menu (choose one from each)
>
> First Course - How do we clean up after partial initialization?
> (1) ExecInitNode functions produce partially initialized nodes
> (2) ExecInitNode functions get refactored so that the stuff that can
> cause early exit always happens first, so that no cleanup is ever
> needed
> (3) ExecInitNode functions do any required cleanup in situ
>
> Second Course - How do we signal that initialization stopped early?
> (A) Return NULL.
> (B) Add a bool * out-parmeter to ExecInitNode.
> (C) Add a Node * out-parameter to ExecInitNode and change the return
> value to bool.
> (D) Add a bool to the EState.
> (E) Something else, maybe.
>
> I think that we need 0001 if we choose specifically (1) and (A). My
> gut feeling is that the least-invasive way to do this project is to
> choose (1) and (D). My second choice would be (1) and (C), and my
> third choice would be (1) and (A). If I can't have (1), I think I
> prefer (2) over (3), but I also believe I prefer hiding in a deep hole
> to either of them. Maybe I'm not seeing the whole picture correctly
> here, but both (2) and (3) look awfully painful to me.

I think what I've ended up with in the attached 0001 (WIP) is both
(1), (2), and (D).  As mentioned above, (D) is implemented with the
ExecPlanStillValid() function.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

11 August 2023, 13:50:26

On Fri, Aug 11, 2023 at 14:31 Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > But should ExecInitNode() subroutines return the partially initialized
> > PlanState node or NULL on detecting invalidation? If I'm
> > understanding how you think this should be working correctly, I think
> > you mean the former, because if it were the latter, ExecInitNode()
> > would end up returning NULL at the top for the root and then there's
> > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > In that case, I think we will need to adjust ExecEndNode() subroutines
> > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > example. That's something Tom had said he doesn't like very much [1].
>
> Yeah, I understood Tom's goal as being "don't return partially
> initialized nodes."
>
> Personally, I'm not sure that's an important goal. In fact, I don't
> even think it's a desirable one. It doesn't look difficult to audit
> the end-node functions for cases where they'd fail if a particular
> pointer were NULL instead of pointing to some real data, and just
> fixing all such cases to have NULL-tests looks like purely mechanical
> work that we are unlikely to get wrong. And at least some cases
> wouldn't require any changes at all.
>
> If we don't do that, the complexity doesn't go away. It just moves
> someplace else. Presumably what we do in that case is have
> ExecInitNode functions undo any initialization that they've already
> done before returning NULL. There are basically two ways to do that.
> Option one is to add code at the point where they return early to
> clean up anything they've already initialized, but that code is likely
> to substantially duplicate whatever the ExecEndNode function already
> knows how to do, and it's very easy for logic like this to get broken
> if somebody rearranges an ExecInitNode function down the road.

Yeah, I too am not a fan of making ExecInitNode() clean up partially
initialized nodes.

> Option
> two is to rearrange the ExecInitNode functions now, to open relations
> or recurse at the beginning, so that we discover the need to fail
> before we initialize anything. That restricts our ability to further
> rearrange the functions in future somewhat, but more importantly,
> IMHO, it introduces more risk right now. Checking that the ExecEndNode
> function will not fail if some pointers are randomly null is a lot
> easier than checking that changing the order of operations in an
> ExecInitNode function breaks nothing.
>
> I'm not here to say that we can't do one of those things. But I think
> adding null-tests to ExecEndNode functions looks like *far* less work
> and *way* less risk.

+1

> There's a second issue here, too, which is when we abort ExecInitNode
> partway through, how do we signal that? You're rightly pointing out
> here that if we do that by returning NULL, then we don't do it by
> returning a pointer to the partially initialized node that we just
> created, which means that we either need to store those partially
> initialized nodes in a separate data structure as you propose to do in
> 0001,
>
> or else we need to pick a different signalling convention. We
> could change (a) ExecInitNode to have an additional argument, bool
> *kaboom, or (b) we could make it return bool and return the node
> pointer via a new additional argument, or (c) we could put a Boolean
> flag into the estate and let the function signal failure by flipping
> the value of the flag.

The failure can already be detected by seeing that
ExecPlanIsValid(estate) is false. The question is what ExecInitNode()
or any of its subroutines should return once it is. I think the
following convention works:

Return partially initialized state from ExecInit* function where we
detect the invalidation after calling ExecInitNode() on a child plan,
so that ExecEndNode() can recurse to clean it up.

Return NULL from ExecInit* functions where we detect the invalidation
after opening and locking a relation but before calling ExecInitNode()
to initialize a child plan if there's one at all. Even if we may set
things like ExprContext, TupleTableSlot fields, they are cleaned up
independently of the plan tree anyway via the cleanup called with
es_exprcontexts, es_tupleTable, respectively. I even noticed bits
like this in ExecEnd* functions:

- /*
- * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
- */
-#ifdef NOT_USED
- ExecFreeExprContext(&node->ss.ps);
- if (node->ioss_RuntimeContext)
- FreeExprContext(node->ioss_RuntimeContext, true);
-#endif

So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
is unnecessary but remain around because nobody cared about and got
around to getting rid of it.

> If we do any of those things, then as far as I
> can see 0001 is unnecessary. If we do none of them but also avoid
> creating partially initialized nodes by one of the two techniques
> mentioned two paragraphs prior, then 0001 is also unnecessary. If we
> do none of them but do create partially initialized nodes, then we
> need 0001.
>
> So if this were a restaurant menu, then it might look like this:
>
> Prix Fixe Menu (choose one from each)
>
> First Course - How do we clean up after partial initialization?
> (1) ExecInitNode functions produce partially initialized nodes
> (2) ExecInitNode functions get refactored so that the stuff that can
> cause early exit always happens first, so that no cleanup is ever
> needed
> (3) ExecInitNode functions do any required cleanup in situ
>
> Second Course - How do we signal that initialization stopped early?
> (A) Return NULL.
> (B) Add a bool * out-parmeter to ExecInitNode.
> (C) Add a Node * out-parameter to ExecInitNode and change the return
> value to bool.
> (D) Add a bool to the EState.
> (E) Something else, maybe.
>
> I think that we need 0001 if we choose specifically (1) and (A). My
> gut feeling is that the least-invasive way to do this project is to
> choose (1) and (D). My second choice would be (1) and (C), and my
> third choice would be (1) and (A). If I can't have (1), I think I
> prefer (2) over (3), but I also believe I prefer hiding in a deep hole
> to either of them. Maybe I'm not seeing the whole picture correctly
> here, but both (2) and (3) look awfully painful to me.

I think what I've ended up with in the attached 0001 (WIP) is both
(1), (2), and (D). As mentioned above, (D) is implemented with the
ExecPlanStillValid() function.

After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is remove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes. We could instead have ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the close-child-before-self behavior for Gather* nodes, and close node type specific cleanup functions for nodes that do have any local cleanup to do. Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual bottom so that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions.

Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

28 August 2023, 13:43:41

On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote:
> After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is
removethe functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes.  We could instead
haveExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the
close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do
haveany local cleanup to do.  Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual
bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. 

I think 0001 needs to be split up. Like, this is code cleanup:

-       /*
-        * Free the exprcontext
-        */
-       ExecFreeExprContext(&node->ss.ps);

This is providing for NULL pointers where we don't currently:

-       list_free_deep(aggstate->hash_batches);
+       if (aggstate->hash_batches)
+               list_free_deep(aggstate->hash_batches);

And this is the early return mechanism per se:

+       if (!ExecPlanStillValid(estate))
+               return aggstate;

I think at least those 3 kinds of changes deserve to be in separate
patches with separate commit messages explaining the rationale behind
each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions.
These calls are no longer required, because <reasons>. Removing them
saves a few CPU cycles and simplifies planned refactoring, so do
that."

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

05 September 2023, 07:13:09

Thanks for taking a look.

On Mon, Aug 28, 2023 at 10:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do
isremove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes.  We could
insteadhave ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the
close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do
haveany local cleanup to do.  Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual
bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. 
>
> I think 0001 needs to be split up. Like, this is code cleanup:
>
> -       /*
> -        * Free the exprcontext
> -        */
> -       ExecFreeExprContext(&node->ss.ps);
>
> This is providing for NULL pointers where we don't currently:
>
> -       list_free_deep(aggstate->hash_batches);
> +       if (aggstate->hash_batches)
> +               list_free_deep(aggstate->hash_batches);
>
> And this is the early return mechanism per se:
>
> +       if (!ExecPlanStillValid(estate))
> +               return aggstate;
>
> I think at least those 3 kinds of changes deserve to be in separate
> patches with separate commit messages explaining the rationale behind
> each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions.
> These calls are no longer required, because <reasons>. Removing them
> saves a few CPU cycles and simplifies planned refactoring, so do
> that."

Breaking up the patch as you describe makes sense, so I've done that:

Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.

0002 adds NULLness checks in ExecEnd*() routines on some pointers that
may not be initialized by the corresponding ExecInit*() routines in
the case where it returns early.

0003 adds the early return mechanism based on checking CachedPlan
invalidation, though no CachedPlan is actually passed to the executor
yet, so no functional changes here yet.

Other patches are rebased over these.  One significant change is in
0004 which does the refactoring to make the callers of ExecutorStart()
aware that it may now return with a partially initialized planstate
tree that should not be executed.  I added a new flag
EState.es_canceled to denote that state of the execution to complement
the existing es_finished.  I also needed to add
AfterTriggerCancelQuery() to ensure that we don't attempt to fire a
canceled query's triggers.  Most of these changes are needed only to
appease the various Asserts in these parts of the code and I thought
they are warranted given the introduction of a new state of query
execution.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Robert Haas

Date:

05 September 2023, 14:41:02

On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.

It also adds a few random Assert()s to verify that unrelated pointers
are not NULL. I suggest that it shouldn't do that.

The commit message doesn't mention the removal of the calls to
ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I
think it would be nice to mention it in the commit message, assuming
that it is in fact OK.

I suggest changing the subject line of the commit to something like
"Remove obsolete executor cleanup code."

> 0002 adds NULLness checks in ExecEnd*() routines on some pointers that
> may not be initialized by the corresponding ExecInit*() routines in
> the case where it returns early.

I think you should only add these where it's needed. For example, I
think list_free_deep(NIL) is fine.

The changes to ExecEndForeignScan look like they include stuff that
belongs in 0001.

Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to
implicit ones like if (x), but opinions vary.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

06 September 2023, 09:12:28

On Tue, Sep 5, 2023 at 11:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.
>
> It also adds a few random Assert()s to verify that unrelated pointers
> are not NULL. I suggest that it shouldn't do that.

OK, removed.

> The commit message doesn't mention the removal of the calls to
> ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I
> think it would be nice to mention it in the commit message, assuming
> that it is in fact OK.

That is not OK, so I dropped their removal. I think I confused them
with slots in other functions initialized with
ExecInitExtraTupleSlot() that *are* put into the estate.

> I suggest changing the subject line of the commit to something like
> "Remove obsolete executor cleanup code."

Sure.

> > 0002 adds NULLness checks in ExecEnd*() routines on some pointers that
> > may not be initialized by the corresponding ExecInit*() routines in
> > the case where it returns early.
>
> I think you should only add these where it's needed. For example, I
> think list_free_deep(NIL) is fine.

OK, done.

> The changes to ExecEndForeignScan look like they include stuff that
> belongs in 0001.

Oops, yes.  Moved to 0001.

> Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to
> implicit ones like if (x), but opinions vary.

I agree, so changed all the new tests to use (x != NULL) form.
Typically, I try to stick with whatever style is used in the nearby
code, though I can see both styles being used in the ExecEnd*()
routines.  I opted to use the style that we both happen to prefer.

Attached updated patches.  Thanks for the review.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > - Is there any point to all of these early exit cases? For example, in
> > ExecInitBitmapAnd, why exit early if initialization fails? Why not
> > just plunge ahead and if initialization failed the caller will notice
> > that and when we ExecEndNode some of the child node pointers will be
> > NULL but who cares? The obvious disadvantage of this approach is that
> > we're doing a bunch of unnecessary initialization, but we're also
> > speeding up the common case where we don't need to abort by avoiding a
> > branch that will rarely be taken. I'm not quite sure what the right
> > thing to do is here.
> I thought about this some and figured that adding the
> is-CachedPlan-still-valid tests in the following places should suffice
> after all:
>
> 1. In InitPlan() right after the top-level ExecInitNode() calls
> 2. In ExecInit*() functions of Scan nodes, right after
> ExecOpenScanRelation() calls

After sleeping on this, I think we do need the checks after all the
ExecInitNode() calls too, because we have many instances of the code
like the following one:

    outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
    tupDesc = ExecGetResultType(outerPlanState(gatherstate));
    <some code that dereferences outDesc>

If outerNode is a SeqScan and ExecInitSeqScan() returned early because
ExecOpenScanRelation() detected that plan was invalidated, then
tupDesc would be NULL in this case, causing the code to crash.

Now one might say that perhaps we should only add the
is-CachedPlan-valid test in the instances where there is an actual
risk of such misbehavior, but that could lead to confusion, now or
later.  It seems better to add them after every ExecInitNode() call
while we're inventing the notion, because doing so relieves the
authors of future enhancements of the ExecInit*() routines from
worrying about any of this.

Attached 0003 should show how that turned out.

Updated 0002 as mentioned in the previous reply -- setting pointers to
NULL after freeing them more consistently across various ExecEnd*()
routines and using the `if (pointer != NULL)` style over the `if
(pointer)` more consistently.

Updated 0001's commit message to remove the mention of its relation to
any future commits.  I intend to push it tomorrow.

Patches 0004 onwards contain changes too, mainly in terms of moving
the code around from one patch to another, but I'll omit the details
of the specific change for now.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

28 September 2023, 08:26:27

On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > - Is there any point to all of these early exit cases? For example, in
> > > ExecInitBitmapAnd, why exit early if initialization fails? Why not
> > > just plunge ahead and if initialization failed the caller will notice
> > > that and when we ExecEndNode some of the child node pointers will be
> > > NULL but who cares? The obvious disadvantage of this approach is that
> > > we're doing a bunch of unnecessary initialization, but we're also
> > > speeding up the common case where we don't need to abort by avoiding a
> > > branch that will rarely be taken. I'm not quite sure what the right
> > > thing to do is here.
> > I thought about this some and figured that adding the
> > is-CachedPlan-still-valid tests in the following places should suffice
> > after all:
> >
> > 1. In InitPlan() right after the top-level ExecInitNode() calls
> > 2. In ExecInit*() functions of Scan nodes, right after
> > ExecOpenScanRelation() calls
>
> After sleeping on this, I think we do need the checks after all the
> ExecInitNode() calls too, because we have many instances of the code
> like the following one:
>
>     outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
>     tupDesc = ExecGetResultType(outerPlanState(gatherstate));
>     <some code that dereferences outDesc>
>
> If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> ExecOpenScanRelation() detected that plan was invalidated, then
> tupDesc would be NULL in this case, causing the code to crash.
>
> Now one might say that perhaps we should only add the
> is-CachedPlan-valid test in the instances where there is an actual
> risk of such misbehavior, but that could lead to confusion, now or
> later.  It seems better to add them after every ExecInitNode() call
> while we're inventing the notion, because doing so relieves the
> authors of future enhancements of the ExecInit*() routines from
> worrying about any of this.
>
> Attached 0003 should show how that turned out.
>
> Updated 0002 as mentioned in the previous reply -- setting pointers to
> NULL after freeing them more consistently across various ExecEnd*()
> routines and using the `if (pointer != NULL)` style over the `if
> (pointer)` more consistently.
>
> Updated 0001's commit message to remove the mention of its relation to
> any future commits.  I intend to push it tomorrow.

Pushed that one.  Here are the rebased patches.

0001 seems ready to me, but I'll wait a couple more days for others to
weigh in.  Just to highlight a kind of change that others may have
differing opinions on, consider this hunk from the patch:

-   MemoryContextDelete(node->aggcontext);
+   if (node->aggcontext != NULL)
+   {
+       MemoryContextDelete(node->aggcontext);
+       node->aggcontext = NULL;
+   }
...
+   ExecEndNode(outerPlanState(node));
+   outerPlanState(node) = NULL;

So the patch wants to enhance the consistency of setting the pointer
to NULL after freeing part.  Robert mentioned his preference for doing
it in the patch, which I agree with.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

20 November 2023, 04:29:53

On Thu, Sep 28, 2023 at 5:26 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > After sleeping on this, I think we do need the checks after all the
> > ExecInitNode() calls too, because we have many instances of the code
> > like the following one:
> >
> >     outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
> >     tupDesc = ExecGetResultType(outerPlanState(gatherstate));
> >     <some code that dereferences outDesc>
> >
> > If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> > ExecOpenScanRelation() detected that plan was invalidated, then
> > tupDesc would be NULL in this case, causing the code to crash.
> >
> > Now one might say that perhaps we should only add the
> > is-CachedPlan-valid test in the instances where there is an actual
> > risk of such misbehavior, but that could lead to confusion, now or
> > later.  It seems better to add them after every ExecInitNode() call
> > while we're inventing the notion, because doing so relieves the
> > authors of future enhancements of the ExecInit*() routines from
> > worrying about any of this.
> >
> > Attached 0003 should show how that turned out.
> >
> > Updated 0002 as mentioned in the previous reply -- setting pointers to
> > NULL after freeing them more consistently across various ExecEnd*()
> > routines and using the `if (pointer != NULL)` style over the `if
> > (pointer)` more consistently.
> >
> > Updated 0001's commit message to remove the mention of its relation to
> > any future commits.  I intend to push it tomorrow.
>
> Pushed that one.  Here are the rebased patches.
>
> 0001 seems ready to me, but I'll wait a couple more days for others to
> weigh in.  Just to highlight a kind of change that others may have
> differing opinions on, consider this hunk from the patch:
>
> -   MemoryContextDelete(node->aggcontext);
> +   if (node->aggcontext != NULL)
> +   {
> +       MemoryContextDelete(node->aggcontext);
> +       node->aggcontext = NULL;
> +   }
> ...
> +   ExecEndNode(outerPlanState(node));
> +   outerPlanState(node) = NULL;
>
> So the patch wants to enhance the consistency of setting the pointer
> to NULL after freeing part.  Robert mentioned his preference for doing
> it in the patch, which I agree with.

Rebased.

I haven't been able to reproduce and debug a crash reported by cfbot
that I see every now and then:

https://cirrus-ci.com/task/5673432591892480?logs=cores#L0

[22:46:12.328] Program terminated with signal SIGSEGV, Segmentation fault.
[22:46:12.328] Address not mapped to object.
[22:46:12.838] #0 afterTriggerInvokeEvents
(events=events@entry=0x836db0460, firing_id=1,
estate=estate@entry=0x842eec100, delete_ok=<optimized out>) at
../src/backend/commands/trigger.c:4656
[22:46:12.838] #1 0x00000000006c67a8 in AfterTriggerEndQuery
(estate=estate@entry=0x842eec100) at
../src/backend/commands/trigger.c:5085
[22:46:12.838] #2 0x000000000065bfba in CopyFrom (cstate=0x836df9038)
at ../src/backend/commands/copyfrom.c:1293
...

While a patch in this series does change
src/backend/commands/trigger.c, I'm not yet sure about its relation
with the backtrace shown there.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Hello Amit,

06.02.2025 04:35, Amit Langote wrote:

I plan to push 0001 tomorrow, barring any objections.

Please try the following script:
CREATE TABLE pt (a int, b int) PARTITION BY range (a);
CREATE TABLE tp1 PARTITION OF pt FOR VALUES FROM (1) TO (2);
CREATE TABLE tp2 PARTITION OF pt FOR VALUES FROM (2) TO (3);

MERGE INTO pt
USING (SELECT pg_backend_pid() AS pid) AS q JOIN tp1 ON (q.pid = tp1.a)
ON pt.a = tp1.a
WHEN MATCHED THEN DELETE;

which fails for me with segfault:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 ExecInitMerge (mtstate=0x5a9b9fbccae0, estate=0x5a9b9fbcbe20) at nodeModifyTable.c:3680
3680 relationDesc = RelationGetDescr(resultRelInfo->ri_RelationDesc);
(gdb) bt
#0 ExecInitMerge (mtstate=0x5a9b9fbccae0, estate=0x5a9b9fbcbe20) at nodeModifyTable.c:3680
#1 0x00005a9b67e6dfb5 in ExecInitModifyTable (node=0x5a9b9fbd5858, estate=0x5a9b9fbcbe20, eflags=0) at nodeModifyTable.c:4906
#2 0x00005a9b67e273f7 in ExecInitNode (node=0x5a9b9fbd5858, estate=0x5a9b9fbcbe20, eflags=0) at execProcnode.c:177
#3 0x00005a9b67e1b9d2 in InitPlan (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:1092
#4 0x00005a9b67e1a524 in standard_ExecutorStart (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:268
#5 0x00005a9b67e1a223 in ExecutorStart (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:142
...

starting from cbc127917.

(I've discovered this anomaly with SQLsmith.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: generic plans and "initial" pruning

From

Junwang Zhao

Date:

16 February, 07:37:07

Hi Amit,

On Sat, Feb 15, 2025 at 3:51 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Alexander,
>
> On Sat, Feb 15, 2025 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > Hello Amit,
> >
> > 06.02.2025 04:35, Amit Langote wrote:
> >
> > I plan to push 0001 tomorrow, barring any objections.
> >
> >
> > Please try the following script:
> > CREATE TABLE pt (a int, b int) PARTITION BY range (a);
> > CREATE TABLE tp1 PARTITION OF pt FOR VALUES FROM (1) TO (2);
> > CREATE TABLE tp2 PARTITION OF pt FOR VALUES FROM (2) TO (3);
> >
> > MERGE INTO pt
> > USING (SELECT pg_backend_pid() AS pid) AS q JOIN tp1 ON (q.pid = tp1.a)
> > ON pt.a = tp1.a
> > WHEN MATCHED THEN DELETE;
> >
> > which fails for me with segfault:
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  ExecInitMerge (mtstate=0x5a9b9fbccae0, estate=0x5a9b9fbcbe20) at nodeModifyTable.c:3680
> > 3680                    relationDesc = RelationGetDescr(resultRelInfo->ri_RelationDesc);
> > (gdb) bt
> > #0  ExecInitMerge (mtstate=0x5a9b9fbccae0, estate=0x5a9b9fbcbe20) at nodeModifyTable.c:3680
> > #1  0x00005a9b67e6dfb5 in ExecInitModifyTable (node=0x5a9b9fbd5858, estate=0x5a9b9fbcbe20, eflags=0) at
nodeModifyTable.c:4906
> > #2  0x00005a9b67e273f7 in ExecInitNode (node=0x5a9b9fbd5858, estate=0x5a9b9fbcbe20, eflags=0) at execProcnode.c:177
> > #3  0x00005a9b67e1b9d2 in InitPlan (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:1092
> > #4  0x00005a9b67e1a524 in standard_ExecutorStart (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:268
> > #5  0x00005a9b67e1a223 in ExecutorStart (queryDesc=0x5a9b9fbb9970, eflags=0) at execMain.c:142
> > ...
> >
> > starting from cbc127917.
> >
> > (I've discovered this anomaly with SQLsmith.)
>
> Thanks! It looks like I missed updating the MERGE-related lists in ModifyTable.
>
> I've attached a fix with a test added based on your example. I plan to
> push this on Monday.
>

I applied the patch and the problem solved, I have a small question that
should the following line

```
if (node->mergeActionLists == NIL)
```

be changed to

```
if (mtstate->mt_mergeActionLists == NIL)
```

ISTM that if we have pruned all the merge actions, there is no harm we
omit setting mtstate->mt_merge_subcommands to 0.

> --
> Thanks, Amit Langote



--
Regards
Junwang Zhao

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

17 February, 10:15:33

Hi Junwang,

On Sun, Feb 16, 2025 at 1:37 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
> On Sat, Feb 15, 2025 at 3:51 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Thanks! It looks like I missed updating the MERGE-related lists in ModifyTable.
> >
> > I've attached a fix with a test added based on your example. I plan to
> > push this on Monday.
> >
>
> I applied the patch and the problem solved,

Thanks for checking.

> I have a small question that
> should the following line
>
> ```
> if (node->mergeActionLists == NIL)
> ```
>
> be changed to
>
> ```
> if (mtstate->mt_mergeActionLists == NIL)
> ```
>
> ISTM that if we have pruned all the merge actions, there is no harm we
> omit setting mtstate->mt_merge_subcommands to 0.

Yeah, that seems harmless, so done.

I have pushed the fix now.

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

21 February, 06:40:09

On Wed, Feb 12, 2025 at 8:53 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Feb 6, 2025 at 11:35 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Per cfbot-ci, the new test case output in 0002 needed to be updated.
> >
> > I plan to push 0001 tomorrow, barring any objections.
>
> I pushed that last Friday. With bb3ec16e, d47cbf47, and cbc12791 now in:
>
> * Pruning information is now stored separately from parent plan nodes
> in PlannedStmt.
>
> * Initial runtime pruning occurs as a separate step, independent of
> and before plan initialization in InitPlan().
>
> * The RT indexes of unprunable relations and those of partitions that
> survive initial pruning are stored in a global bitmapset in EState,
> allowing us to avoid work that was previously done for pruned
> partitions. This was difficult before because initial pruning wasn’t
> performed before the parent plan node was initialized, meaning that
> the work we aimed to save had already been done.
>
> The final remaining piece is to skip taking locks on partitions pruned
> during initial pruning, and the attached patch addresses that.
>
> I’d like to commit the patch next week, barring objections.

I pushed the final piece yesterday.

Thank you all who have commented on this thread, reviewed the patches
in its various incarnations, and offered advice here or offlist.

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Tom Lane

Date:

21 February, 09:04:39

Amit Langote <amitlangote09@gmail.com> writes:
> I pushed the final piece yesterday.

trilobite reports that this fails under -DCLOBBER_CACHE_ALWAYS:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2025-02-20%2019%3A37%3A12

            regards, tom lane

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

21 February, 09:36:41

On Fri, Feb 21, 2025 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Langote <amitlangote09@gmail.com> writes:
> > I pushed the final piece yesterday.
>
> trilobite reports that this fails under -DCLOBBER_CACHE_ALWAYS:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2025-02-20%2019%3A37%3A12

Looking, thanks for the heads up.


--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

21 February, 11:07:09

On Fri, Feb 21, 2025 at 3:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Feb 21, 2025 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Langote <amitlangote09@gmail.com> writes:
> > > I pushed the final piece yesterday.
> >
> > trilobite reports that this fails under -DCLOBBER_CACHE_ALWAYS:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2025-02-20%2019%3A37%3A12
>
> Looking, thanks for the heads up.

The short of it is that the cached-plan-inval test in the
delay_execution suite can never be made to work under
CLOBBER_CACHE_ALWAYS. The test assumes that locks on partitions for a
reused generic plan are not taken until InitPlan(). However, under
CLOBBER_CACHE_ALWAYS, generic plans are never reused, so the test's
assumption never holds.

I see two possible ways to address this:

1. Find a way to disable the cached-plan-inval test in
CLOBBER_CACHE_ALWAYS builds. However, I haven't found any other test
that does this.

2. Remove the test altogether, though that might be too drastic.

Thoughts?

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Tom Lane

Date:

21 February, 18:55:03

Amit Langote <amitlangote09@gmail.com> writes:
> The short of it is that the cached-plan-inval test in the
> delay_execution suite can never be made to work under
> CLOBBER_CACHE_ALWAYS. The test assumes that locks on partitions for a
> reused generic plan are not taken until InitPlan(). However, under
> CLOBBER_CACHE_ALWAYS, generic plans are never reused, so the test's
> assumption never holds.

Ugh.

> I see two possible ways to address this:

> 1. Find a way to disable the cached-plan-inval test in
> CLOBBER_CACHE_ALWAYS builds. However, I haven't found any other test
> that does this.

> 2. Remove the test altogether, though that might be too drastic.

Well, you could force matters with "set debug_discard_caches = 0"
within the test, but I think that's just a band-aid that would
not make the test fully stable.  The point of CLOBBER_CACHE_ALWAYS
is to model random arrival of cache flush events, which is *always*
a possibility due to background activity (autovacuum for instance).

We do have a couple of other regression tests that rely on
"set debug_discard_caches = 0", and I've not seen many buildfarm
failures tracing to that, but I don't trust it a whole lot.

How badly do you want to keep this test case?  It seems fairly
rickety to me, even without this particular concern.

            regards, tom lane

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

22 February, 05:13:24

On Sat, Feb 22, 2025 at 12:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > The short of it is that the cached-plan-inval test in the
> > delay_execution suite can never be made to work under
> > CLOBBER_CACHE_ALWAYS. The test assumes that locks on partitions for a
> > reused generic plan are not taken until InitPlan(). However, under
> > CLOBBER_CACHE_ALWAYS, generic plans are never reused, so the test's
> > assumption never holds.
>
> Ugh.
>
> > I see two possible ways to address this:
>
> > 1. Find a way to disable the cached-plan-inval test in
> > CLOBBER_CACHE_ALWAYS builds. However, I haven't found any other test
> > that does this.
>
> > 2. Remove the test altogether, though that might be too drastic.
>
> Well, you could force matters with "set debug_discard_caches = 0"
> within the test, but I think that's just a band-aid that would
> not make the test fully stable.  The point of CLOBBER_CACHE_ALWAYS
> is to model random arrival of cache flush events, which is *always*
> a possibility due to background activity (autovacuum for instance).
>
> We do have a couple of other regression tests that rely on
> "set debug_discard_caches = 0", and I've not seen many buildfarm
> failures tracing to that, but I don't trust it a whole lot.
>
> How badly do you want to keep this test case?  It seems fairly
> rickety to me, even without this particular concern.

Hmm, yeah, I have to admit that even if we address this specific
issue, the risk of this test failing again outweighs the likelihood of
it catching a real breakage in the deferred lock mechanism.

I'll remove the test for now.

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

22 February, 09:29:25

On Sat, Feb 22, 2025 at 11:13 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Feb 22, 2025 at 12:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Amit Langote <amitlangote09@gmail.com> writes:
> > > The short of it is that the cached-plan-inval test in the
> > > delay_execution suite can never be made to work under
> > > CLOBBER_CACHE_ALWAYS. The test assumes that locks on partitions for a
> > > reused generic plan are not taken until InitPlan(). However, under
> > > CLOBBER_CACHE_ALWAYS, generic plans are never reused, so the test's
> > > assumption never holds.
> >
> > Ugh.
> >
> > > I see two possible ways to address this:
> >
> > > 1. Find a way to disable the cached-plan-inval test in
> > > CLOBBER_CACHE_ALWAYS builds. However, I haven't found any other test
> > > that does this.
> >
> > > 2. Remove the test altogether, though that might be too drastic.
> >
> > Well, you could force matters with "set debug_discard_caches = 0"
> > within the test, but I think that's just a band-aid that would
> > not make the test fully stable.  The point of CLOBBER_CACHE_ALWAYS
> > is to model random arrival of cache flush events, which is *always*
> > a possibility due to background activity (autovacuum for instance).
> >
> > We do have a couple of other regression tests that rely on
> > "set debug_discard_caches = 0", and I've not seen many buildfarm
> > failures tracing to that, but I don't trust it a whole lot.
> >
> > How badly do you want to keep this test case?  It seems fairly
> > rickety to me, even without this particular concern.
>
> Hmm, yeah, I have to admit that even if we address this specific
> issue, the risk of this test failing again outweighs the likelihood of
> it catching a real breakage in the deferred lock mechanism.
>
> I'll remove the test for now.

Done. I'll try to think of a more robust testing approach for this,
but I’m not very optimistic :-(.

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Alexander Lakhin

Date:

22 February, 18:00:01

Hello Amit,

21.02.2025 05:40, Amit Langote wrote:

I pushed the final piece yesterday.

Please look at new error, produced by the following script,
starting from 525392d57:
CREATE TABLE t(id int) PARTITION BY RANGE (id);
CREATE INDEX idx on t(id);
CREATE TABLE tp_1 PARTITION OF t FOR VALUES FROM (10) TO (20);
CREATE TABLE tp_2 PARTITION OF t FOR VALUES FROM (20) TO (30) PARTITION BY RANGE(id);
CREATE TABLE tp_2_1 PARTITION OF tp_2 FOR VALUES FROM (21) to (22);
CREATE TABLE tp_2_2 PARTITION OF tp_2 FOR VALUES FROM (22) to (23);
CREATE FUNCTION stable_one() RETURNS INT AS $$ BEGIN RETURN 1; END; $$ LANGUAGE plpgsql STABLE;

SELECT min(id) OVER (PARTITION BY id ORDER BY id) FROM t WHERE id >= stable_one();

ERROR: XX000: trying to open a pruned relation
LOCATION: ExecGetRangeTableRelation, execUtils.c:830

This issue was discovered with SQLsmith.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: generic plans and "initial" pruning

From

Tender Wang

Date:

22 February, 20:02:56

Alexander Lakhin <exclusion@gmail.com> 于2025年2月22日周六 23:00写道：

Hello Amit,

21.02.2025 05:40, Amit Langote wrote:
I pushed the final piece yesterday.
Please look at new error, produced by the following script,
starting from 525392d57:
CREATE TABLE t(id int) PARTITION BY RANGE (id);
CREATE INDEX idx on t(id);
CREATE TABLE tp_1 PARTITION OF t FOR VALUES FROM (10) TO (20);
CREATE TABLE tp_2 PARTITION OF t FOR VALUES FROM (20) TO (30) PARTITION BY RANGE(id);
CREATE TABLE tp_2_1 PARTITION OF tp_2 FOR VALUES FROM (21) to (22);
CREATE TABLE tp_2_2 PARTITION OF tp_2 FOR VALUES FROM (22) to (23);
CREATE FUNCTION stable_one() RETURNS INT AS $$ BEGIN RETURN 1; END; $$ LANGUAGE plpgsql STABLE;

SELECT min(id) OVER (PARTITION BY id ORDER BY id) FROM t WHERE id >= stable_one();

ERROR: XX000: trying to open a pruned relation
LOCATION: ExecGetRangeTableRelation, execUtils.c:830

This issue was discovered with SQLsmith.

The error message was added in commit 525392d57. In this case, the estate->es_unpruned_relids only includes 1, which is the offset of table t.

In register_partpruneinfo(), we collect glob->prunableRelids; in this case, it contains 2,3,4,5. Then we will do:

result->unprunableRelids = bms_difference(glob->allRelids,
glob->prunableRelids);

so the result->unprunableRelids only contains 1.

But tp_2 is also partition table, and its partpruneinfo created by create_append_plan() is put into the head of global list.

So we first process it in ExecDoInitialPruning(). Then error reports because we only contain 1 in estate->es_unpruned_relids.

Thanks,

Tender Wang

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

23 February, 11:35:49

On Sun, Feb 23, 2025 at 2:03 AM Tender Wang <tndrwang@gmail.com> wrote:
> Alexander Lakhin <exclusion@gmail.com> 于2025年2月22日周六 23:00写道：
>> 21.02.2025 05:40, Amit Langote wrote:
>>
>> I pushed the final piece yesterday.
>>
>>
>> Please look at new error, produced by the following script,
>> starting from 525392d57:
>> CREATE TABLE t(id int) PARTITION BY RANGE (id);
>> CREATE INDEX idx on t(id);
>> CREATE TABLE tp_1 PARTITION OF t FOR VALUES FROM (10) TO (20);
>> CREATE TABLE tp_2 PARTITION OF t FOR VALUES FROM (20) TO (30) PARTITION BY RANGE(id);
>> CREATE TABLE tp_2_1 PARTITION OF tp_2 FOR VALUES FROM (21) to (22);
>> CREATE TABLE tp_2_2 PARTITION OF tp_2 FOR VALUES FROM (22) to (23);
>> CREATE FUNCTION stable_one() RETURNS INT AS $$ BEGIN RETURN 1; END; $$ LANGUAGE plpgsql STABLE;
>>
>> SELECT min(id) OVER (PARTITION BY id ORDER BY id) FROM t WHERE id >= stable_one();
>>
>> ERROR:  XX000: trying to open a pruned relation
>> LOCATION:  ExecGetRangeTableRelation, execUtils.c:830
>>
>> This issue was discovered with SQLsmith.

Thanks for the report.

> The error message was added in commit  525392d57. In this case, the estate->es_unpruned_relids only includes 1, which
isthe offset of table t. 
> In register_partpruneinfo(), we collect glob->prunableRelids; in this case, it contains 2,3,4,5. Then we will do:
> result->unprunableRelids = bms_difference(glob->allRelids,
>  glob->prunableRelids);
> so the result->unprunableRelids only contains 1.
>
> But tp_2 is also partition table, and its partpruneinfo created by create_append_plan() is put into the head of
globallist. 
> So we first process it in ExecDoInitialPruning().  Then error reports because we only contain 1 in
estate->es_unpruned_relids.

Thanks for checking.

The RT index of tp_2 should appear in PlannedStmt.unprunableRelids,
because it needs to be opened in CreatePartitionPruneState() for
setting up its PartitionPruneInfo. We use ExecGetRangeTableRelation()
to open, which expects the relation to be locked, so the error.

To ensure tp_2 appears in PlannedStmt.unprunableRelids, we should
prevent make_partitionedrel_pruneinfo() from placing the RT index into
leafpart_rti_map[], as the current condition for inclusion doesn’t
account for whether the partition is itself partitioned.

I've come up with the attached.

--
Thanks, Amit Langote

Attachment

v1-0001-Fix-bug-in-cbc127917-to-handle-nested-Append-corr.patch

Re: generic plans and "initial" pruning

From

Tender Wang

Date:

23 February, 15:46:03

Amit Langote <amitlangote09@gmail.com> 于2025年2月23日周日 16:36写道：

On Sun, Feb 23, 2025 at 2:03 AM Tender Wang <tndrwang@gmail.com> wrote:
> Alexander Lakhin <exclusion@gmail.com> 于2025年2月22日周六 23:00写道：
>> 21.02.2025 05:40, Amit Langote wrote:
>>
>> I pushed the final piece yesterday.
>>
>>
>> Please look at new error, produced by the following script,
>> starting from 525392d57:
>> CREATE TABLE t(id int) PARTITION BY RANGE (id);
>> CREATE INDEX idx on t(id);
>> CREATE TABLE tp_1 PARTITION OF t FOR VALUES FROM (10) TO (20);
>> CREATE TABLE tp_2 PARTITION OF t FOR VALUES FROM (20) TO (30) PARTITION BY RANGE(id);
>> CREATE TABLE tp_2_1 PARTITION OF tp_2 FOR VALUES FROM (21) to (22);
>> CREATE TABLE tp_2_2 PARTITION OF tp_2 FOR VALUES FROM (22) to (23);
>> CREATE FUNCTION stable_one() RETURNS INT AS $$ BEGIN RETURN 1; END; $$ LANGUAGE plpgsql STABLE;
>>
>> SELECT min(id) OVER (PARTITION BY id ORDER BY id) FROM t WHERE id >= stable_one();
>>
>> ERROR: XX000: trying to open a pruned relation
>> LOCATION: ExecGetRangeTableRelation, execUtils.c:830
>>
>> This issue was discovered with SQLsmith.

Thanks for the report.

> The error message was added in commit 525392d57. In this case, the estate->es_unpruned_relids only includes 1, which is the offset of table t.
> In register_partpruneinfo(), we collect glob->prunableRelids; in this case, it contains 2,3,4,5. Then we will do:
> result->unprunableRelids = bms_difference(glob->allRelids,
> glob->prunableRelids);
> so the result->unprunableRelids only contains 1.
>
> But tp_2 is also partition table, and its partpruneinfo created by create_append_plan() is put into the head of global list.
> So we first process it in ExecDoInitialPruning(). Then error reports because we only contain 1 in estate->es_unpruned_relids.

Thanks for checking.

The RT index of tp_2 should appear in PlannedStmt.unprunableRelids,
because it needs to be opened in CreatePartitionPruneState() for
setting up its PartitionPruneInfo. We use ExecGetRangeTableRelation()
to open, which expects the relation to be locked, so the error.

To ensure tp_2 appears in PlannedStmt.unprunableRelids, we should
prevent make_partitionedrel_pruneinfo() from placing the RT index into
leafpart_rti_map[], as the current condition for inclusion doesn’t
account for whether the partition is itself partitioned.

I've come up with the attached.

LGTM.

Thanks,

Tender Wang

Re: generic plans and "initial" pruning

From

Amit Langote

Date:

25 February, 05:51:31

On Sun, Feb 23, 2025 at 9:46 PM Tender Wang <tndrwang@gmail.com> wrote:
> Amit Langote <amitlangote09@gmail.com> 于2025年2月23日周日 16:36写道：
>> On Sun, Feb 23, 2025 at 2:03 AM Tender Wang <tndrwang@gmail.com> wrote:
>> > Alexander Lakhin <exclusion@gmail.com> 于2025年2月22日周六 23:00写道：
>> >> Please look at new error, produced by the following script,
>> >> starting from 525392d57:
>> >> CREATE TABLE t(id int) PARTITION BY RANGE (id);
>> >> CREATE INDEX idx on t(id);
>> >> CREATE TABLE tp_1 PARTITION OF t FOR VALUES FROM (10) TO (20);
>> >> CREATE TABLE tp_2 PARTITION OF t FOR VALUES FROM (20) TO (30) PARTITION BY RANGE(id);
>> >> CREATE TABLE tp_2_1 PARTITION OF tp_2 FOR VALUES FROM (21) to (22);
>> >> CREATE TABLE tp_2_2 PARTITION OF tp_2 FOR VALUES FROM (22) to (23);
>> >> CREATE FUNCTION stable_one() RETURNS INT AS $$ BEGIN RETURN 1; END; $$ LANGUAGE plpgsql STABLE;
>> >>
>> >> SELECT min(id) OVER (PARTITION BY id ORDER BY id) FROM t WHERE id >= stable_one();
>> >>
>> >> ERROR:  XX000: trying to open a pruned relation
>> >> LOCATION:  ExecGetRangeTableRelation, execUtils.c:830
>> >>
>> >> This issue was discovered with SQLsmith.
>>
>> Thanks for the report.
>>
>> > The error message was added in commit  525392d57. In this case, the estate->es_unpruned_relids only includes 1,
whichis the offset of table t. 
>> > In register_partpruneinfo(), we collect glob->prunableRelids; in this case, it contains 2,3,4,5. Then we will do:
>> > result->unprunableRelids = bms_difference(glob->allRelids,
>> >  glob->prunableRelids);
>> > so the result->unprunableRelids only contains 1.
>> >
>> > But tp_2 is also partition table, and its partpruneinfo created by create_append_plan() is put into the head of
globallist. 
>> > So we first process it in ExecDoInitialPruning().  Then error reports because we only contain 1 in
estate->es_unpruned_relids.
>>
>> Thanks for checking.
>>
>> The RT index of tp_2 should appear in PlannedStmt.unprunableRelids,
>> because it needs to be opened in CreatePartitionPruneState() for
>> setting up its PartitionPruneInfo. We use ExecGetRangeTableRelation()
>> to open, which expects the relation to be locked, so the error.
>>
>> To ensure tp_2 appears in PlannedStmt.unprunableRelids, we should
>> prevent make_partitionedrel_pruneinfo() from placing the RT index into
>> leafpart_rti_map[], as the current condition for inclusion doesn’t
>> account for whether the partition is itself partitioned.
>>
>> I've come up with the attached.
>
> LGTM.

Pushed after some tweaks to comments and the test case.

--
Thanks, Amit Langote

Re: generic plans and "initial" pruning

From

Tom Lane

Date:

20 May, 06:06:21

Amit Langote <amitlangote09@gmail.com> writes:
> Pushed after some tweaks to comments and the test case.

My attention was drawn to commit 525392d57 after observing that
Valgrind complained about a memory leak in some code that commit added
to BuildCachedPlan().  I tried to make sense of said code so I could
remove the leak, and eventually arrived at the attached patch, which
is part of a series of leak-fixing things hence the high sequence
number.

Unfortunately, the bad things I speculated about in the added comments
seem to be reality.  The second attached file is a test case that
triggers

TRAP: failed Assert("list_length(plan_list) == list_length(plan->stmt_list)"), File: "plancache.c", Line: 1259, PID:
602087

because it adds a DO ALSO rule that causes the rewriter to generate
more PlannedStmts than it did before.

This is quite awful, because it does more than simply break the klugy
(and undocumented) business about keeping the top-level List in a
different context.  What it means is that any outside code that is
busy iterating that List is very fundamentally broken: it's not clear
what List index it ought to resume at, except that "the one it was at"
is demonstrably incorrect.

I also don't really believe the (also undocumented) assumption that
such outside code is in between executions of PlannedStmts of the
List and hence can tolerate those being ripped out and replaced.
I have not attempted to build an example, because the one I have
seems sufficiently damning.  But I bet that a recursive function
could be constructed in such a way that an outer execution is
still in progress when an inner call triggers UpdateCachedPlan.

Another small problem (much more easily fixable than the above,
probably) is that summarily setting "plan->is_valid = true"
at the end is not okay.  We could already have received an
invalidation that should result in marking the plan stale.
(Holding locks on the tables involved is not sufficient to
prevent that, as there are other sources of inval events.)

It's possible that this code can be fixed, but I fear it's
going to involve some really fundamental redesign, which
probably shouldn't be happening after beta1.  I think there
is no alternative but to revert for v18.

            regards, tom lane

From a680e6b6885378beb0164e465b50afd81558ebc5 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 19 May 2025 00:02:20 -0400
Subject: [PATCH v2 10/20] Partially fix some extremely broken code from
 525392d57.

Avoid leaking memory in the stmt_context during BuildCachedPlan.
Sadly, this code has problems a lot worse than that (per the
documentation I added), so I suspect 525392d57 will get reverted
and we won't need this patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
---
 src/backend/utils/cache/plancache.c | 37 ++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 9bcbc4c3e97..40ba3e9df7c 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -1109,22 +1109,32 @@ BuildCachedPlan(CachedPlanSource *plansource, List *qlist,
      */
     if (!plansource->is_oneshot)
     {
+        List       *stmt_plist;
+
         plan_context = AllocSetContextCreate(CurrentMemoryContext,
                                              "CachedPlan",
                                              ALLOCSET_START_SMALL_SIZES);
         MemoryContextCopyAndSetIdentifier(plan_context, plansource->query_string);

-        stmt_context = AllocSetContextCreate(CurrentMemoryContext,
+        stmt_context = AllocSetContextCreate(plan_context,
                                              "CachedPlan PlannedStmts",
                                              ALLOCSET_START_SMALL_SIZES);
         MemoryContextCopyAndSetIdentifier(stmt_context, plansource->query_string);
-        MemoryContextSetParent(stmt_context, plan_context);

+        /*
+         * Copy plans into the stmt_context.
+         */
         MemoryContextSwitchTo(stmt_context);
-        plist = copyObject(plist);
+        stmt_plist = copyObject(plist);

+        /*
+         * We actually need the top-level List object to be in the long-lived
+         * plan_context, in case UpdateCachedPlan wants to update it; see
+         * comments therein.  Do a shallow copy to make that happen.
+         */
         MemoryContextSwitchTo(plan_context);
-        plist = list_copy(plist);
+        plist = list_copy(stmt_plist);
+        list_free(stmt_plist);    /* be tidy */
     }
     else
         plan_context = CurrentMemoryContext;
@@ -1251,12 +1261,22 @@ UpdateCachedPlan(CachedPlanSource *plansource, int query_index,

     /*
      * Planning work is done in the caller's memory context.  The resulting
-     * PlannedStmt is then copied into plan->stmt_context after throwing away
-     * the old ones.
+     * PlannedStmt(s) are then copied into plan->stmt_context after throwing
+     * away the old ones.  But note that we re-use the long-lived
+     * plan->stmt_list list to hold the pointers to the PlannedStmts.  This
+     * kluge avoids breaking code that is iterating over that list, so long as
+     * it's between statements and not currently using one of the contained
+     * PlannedStmts.
+     *
+     * XXX this is, if not actively broken, at least unbelievably fragile.
+     * Aside from the likelihood that the just-stated assumption doesn't hold
+     * universally, there is not a good reason to believe that the length of
+     * the plan list is constant.
      */
     plan_list = pg_plan_queries(query_list, plansource->query_string,
                                 plansource->cursor_options, NULL);
-    Assert(list_length(plan_list) == list_length(plan->stmt_list));
+    if (list_length(plan_list) != list_length(plan->stmt_list))
+        elog(ERROR, "UpdateCachedPlan(): plan list length changed");

     MemoryContextReset(plan->stmt_context);
     oldcxt = MemoryContextSwitchTo(plan->stmt_context);
@@ -1276,7 +1296,8 @@ UpdateCachedPlan(CachedPlanSource *plansource, int query_index,

     /*
      * We've updated all the plans that might have been invalidated, so mark
-     * the CachedPlan as valid.
+     * the CachedPlan as valid.  XXX wrong: we could already have hit a new
+     * invalidation event.
      */
     plan->is_valid = true;

--
2.43.5

drop table if exists test_table;

CREATE TABLE test_table (a int);

create or replace function doit(r int, a int) returns bool
language plpgsql as $$
begin
  raise notice 'r = %, a = %', r, a;
  if (r = 10) then
    CREATE RULE make_noise AS ON DELETE TO test_table
    DO ALSO INSERT INTO test_table SELECT 2;
    raise notice 'made rule';
  end if;
  if (r = 20 and a = 1) then
    CREATE RULE make_noise_2 AS ON DELETE TO test_table
    DO ALSO INSERT INTO test_table SELECT 3;
    raise notice 'made rule 2';
  end if;
  return true;
end$$;

set plan_cache_mode to force_generic_plan;

DO $$
BEGIN
    FOR r IN 1..30 LOOP
        TRUNCATE test_table;
        INSERT INTO test_table SELECT 1;
        DELETE FROM test_table where doit(r,a);
    END LOOP;
END$$;

table test_table;

Re: generic plans and "initial" pruning

From

Tom Lane

Date:

20 May, 18:38:31

Amit Langote <amitlangote09@gmail.com> writes:
> Thanks for pointing out the hole in the current handling of
> CachedPlan->stmt_list. You're right that the approach of preserving
> the list structure while replacing its contents in-place doesn’t hold
> up when the rewriter adds or removes statements dynamically. There
> might be other cases that neither of us have tried.  I don’t think
> that mechanism is salvageable.

> To address the issue without needing a full revert, I’m considering
> dropping UpdateCachedPlan() and removing the associated MemoryContext
> dance to preserve CachedPlan->stmt_list structure. Instead, the
> executor would replan the necessary query into a transient list of
> PlannedStmts, leaving the original CachedPlan untouched. That avoids
> mutating shared plan state during execution and still enables deferred
> locking in the vast majority of cases.

Yeah, I think messing with the CachedPlan is just fundamentally wrong.
It breaks the invariant that the executor should not scribble on what
it's handed --- maybe not as obviously as some other cases, but it's
still not a good design.

I kind of feel that we ought to take two steps back and think
about what it even means to have a generic plan in this situation.
Perhaps we should simply refuse to use that code path if there are
prunable partitioned tables involved?

> Let me know what you think -- I’ll hold off on posting a revert or a
> replacement until we’ve agreed on the path forward.

I had not looked at 525392d57 in any detail before (the claim in
the commit message that I reviewed it is a figment of someone's
imagination).  Now that I have, I'm still going to argue for revert.
Aside from the points above, I really hate what's been done to the
fundamental executor APIs.  The fact that ExecutorStart callers have
to know about this is as ugly as can be.  I also don't like the
fact that it's added overhead in cases where there can be no benefit
(notice that my test case doesn't even involve a partitioned table).

I still like the core idea of deferring locking, but I don't like
anything about this implementation of it.  It seems like there has
to be a better and simpler way.

            regards, tom lane