Thread: generic plans and "initial" pruning

generic plans and "initial" pruning

From
Amit Langote
Date:
Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.

Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase.  A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.

However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking.  The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions.   I've attempted
implementing that idea in the attached patch.

Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.

Using the following benchmark setup:

pgbench testdb -i --partitions=$nparts > /dev/null 2>&1
pgbench -n testdb -S -T 30 -Mprepared

And plan_cache_mode = force_generic_plan,

I get following numbers:

HEAD:

32      tps = 20561.776403 (without initial connection time)
64      tps = 12553.131423 (without initial connection time)
128     tps = 13330.365696 (without initial connection time)
256     tps = 8605.723120 (without initial connection time)
512     tps = 4435.951139 (without initial connection time)
1024    tps = 2346.902973 (without initial connection time)
2048    tps = 1334.680971 (without initial connection time)

Patched:

32      tps = 27554.156077 (without initial connection time)
64      tps = 27531.161310 (without initial connection time)
128     tps = 27138.305677 (without initial connection time)
256     tps = 25825.467724 (without initial connection time)
512     tps = 19864.386305 (without initial connection time)
1024    tps = 18742.668944 (without initial connection time)
2048    tps = 16312.412704 (without initial connection time)

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=PK9vrhxiHBQFZ+hc0JCwwnRKkF3hg@mail.gmail.com

[2] https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com

Attachment

Re: generic plans and "initial" pruning

From
Ashutosh Bapat
Date:
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Executing generic plans involving partitions is known to become slower
> as partition count grows due to a number of bottlenecks, with
> AcquireExecutorLocks() showing at the top in profiles.
>
> Previous attempt at solving that problem was by David Rowley [1],
> where he proposed delaying locking of *all* partitions appearing under
> an Append/MergeAppend until "initial" pruning is done during the
> executor initialization phase.  A problem with that approach that he
> has described in [2] is that leaving partitions unlocked can lead to
> race conditions where the Plan node belonging to a partition can be
> invalidated when a concurrent session successfully alters the
> partition between AcquireExecutorLocks() saying the plan is okay to
> execute and then actually executing it.
>
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.
>

In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?

The benchmark is showing good results, indeed.


-- 
Best Wishes,
Ashutosh Bapat



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Executing generic plans involving partitions is known to become slower
> as partition count grows due to a number of bottlenecks, with
> AcquireExecutorLocks() showing at the top in profiles.
>
> Previous attempt at solving that problem was by David Rowley [1],
> where he proposed delaying locking of *all* partitions appearing under
> an Append/MergeAppend until "initial" pruning is done during the
> executor initialization phase.  A problem with that approach that he
> has described in [2] is that leaving partitions unlocked can lead to
> race conditions where the Plan node belonging to a partition can be
> invalidated when a concurrent session successfully alters the
> partition between AcquireExecutorLocks() saying the plan is okay to
> execute and then actually executing it.
>
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.
>

In which cases, we will have "pre-execution" pruning instructions that
can be used to skip locking partitions? Can you please give a few
examples where this approach will be useful?

This is mainly to be useful for prepared queries, so something like:

prepare q as select * from partitioned_table where key = $1;

And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for all partitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the number of partitions to determine that the plan is still usable / has not been invalidated; most of that is AcquireExecutorLocks().

Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process the range table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the plan is generic.

The benchmark is showing good results, indeed.

Thanks.
--

Re: generic plans and "initial" pruning

From
Amul Sul
Date:
On Fri, Dec 31, 2021 at 7:56 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Tue, Dec 28, 2021 at 22:12 Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> On Sat, Dec 25, 2021 at 9:06 AM Amit Langote <amitlangote09@gmail.com> wrote:
>> >
>> > Executing generic plans involving partitions is known to become slower
>> > as partition count grows due to a number of bottlenecks, with
>> > AcquireExecutorLocks() showing at the top in profiles.
>> >
>> > Previous attempt at solving that problem was by David Rowley [1],
>> > where he proposed delaying locking of *all* partitions appearing under
>> > an Append/MergeAppend until "initial" pruning is done during the
>> > executor initialization phase.  A problem with that approach that he
>> > has described in [2] is that leaving partitions unlocked can lead to
>> > race conditions where the Plan node belonging to a partition can be
>> > invalidated when a concurrent session successfully alters the
>> > partition between AcquireExecutorLocks() saying the plan is okay to
>> > execute and then actually executing it.
>> >
>> > However, using an idea that Robert suggested to me off-list a little
>> > while back, it seems possible to determine the set of partitions that
>> > we can safely skip locking.  The idea is to look at the "initial" or
>> > "pre-execution" pruning instructions contained in a given Append or
>> > MergeAppend node when AcquireExecutorLocks() is collecting the
>> > relations to lock and consider relations from only those sub-nodes
>> > that survive performing those instructions.   I've attempted
>> > implementing that idea in the attached patch.
>> >
>>
>> In which cases, we will have "pre-execution" pruning instructions that
>> can be used to skip locking partitions? Can you please give a few
>> examples where this approach will be useful?
>
>
> This is mainly to be useful for prepared queries, so something like:
>
> prepare q as select * from partitioned_table where key = $1;
>
> And that too when execute q(…) uses a generic plan. Generic plans are problematic because it must contain nodes for
allpartitions (without any plan time pruning), which means CheckCachedPlan() has to spend time proportional to the
numberof partitions to determine that the plan is still usable / has not been invalidated; most of that is
AcquireExecutorLocks().
>
> Other bottlenecks, not addressed in this patch, pertain to some executor startup/shutdown subroutines that process
therange table of a PlannedStmt in its entirety, whose length is also proportional to the number of partitions when the
planis generic. 
>
>> The benchmark is showing good results, indeed.
>
Indeed.

Here are few comments for v1 patch:

+   /* Caller error if we get here without contains_init_steps */
+   Assert(pruneinfo->contains_init_steps);

-       prunedata = prunestate->partprunedata[i];
-       pprune = &prunedata->partrelprunedata[0];

-       /* Perform pruning without using PARAM_EXEC Params */
-       find_matching_subplans_recurse(prunedata, pprune, true, &result);
+   if (parentrelids)
+       *parentrelids = NULL;

You got two blank lines after Assert.
--

+   /* Set up EState if not in the executor proper. */
+   if (estate == NULL)
+   {
+       estate = CreateExecutorState();
+       estate->es_param_list_info = params;
+       free_estate = true;
    }

... [Skip]

+   if (free_estate)
+   {
+       FreeExecutorState(estate);
+       estate = NULL;
    }

I think this work should be left to the caller.
--

    /*
     * Stuff that follows matches exactly what ExecCreatePartitionPruneState()
     * does, except we don't need a PartitionPruneState here, so don't call
     * that function.
     *
     * XXX some refactoring might be good.
     */

+1, while doing it would be nice if foreach_current_index() is used
instead of the i & j sequence in the respective foreach() block, IMO.
--

+                   while ((i = bms_next_member(validsubplans, i)) >= 0)
+                   {
+                       Plan   *subplan = list_nth(subplans, i);
+
+                       context->relations =
+                           bms_add_members(context->relations,
+                                           get_plan_scanrelids(subplan));
+                   }

I think instead of get_plan_scanrelids() the
GetLockableRelations_worker() can be used; if so, then no need to add
get_plan_scanrelids() function.
--

     /* Nodes containing prunable subnodes. */
+       case T_MergeAppend:
+           {
+               PlannedStmt *plannedstmt = context->plannedstmt;
+               List       *rtable = plannedstmt->rtable;
+               ParamListInfo params = context->params;
+               PartitionPruneInfo *pruneinfo;
+               Bitmapset  *validsubplans;
+               Bitmapset  *parentrelids;

...
                if (pruneinfo && pruneinfo->contains_init_steps)
                {
                    int     i;
...
                   return false;
                }
            }
            break;

Most of the declarations need to be moved inside the if-block.

Also, initially, I was a bit concerned regarding this code block
inside GetLockableRelations_worker(), what if (pruneinfo &&
pruneinfo->contains_init_steps) evaluated to false? After debugging I
realized that plan_tree_walker() will do the needful -- a bit of
comment would have helped.
--

+       case T_CustomScan:
+           foreach(lc, ((CustomScan *) plan)->custom_plans)
+           {
+               if (walker((Plan *) lfirst(lc), context))
+                   return true;
+           }
+           break;

Why not plan_walk_members() call like other nodes?

Regards,
Amul



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.

Hmm. The first question that occurs to me is whether this is fully safe.

Currently, AcquireExecutorLocks calls LockRelationOid for every
relation involved in the query. That means we will probably lock at
least one relation on which we previously had no lock and thus
AcceptInvalidationMessages(). That will end up marking the query as no
longer valid and CheckCachedPlan() will realize this and tell the
caller to replan. In the corner case where we already hold all the
required locks, we will not accept invalidation messages at this
point, but must have done so after acquiring the last of the locks
required, and if that didn't mark the plan invalid, it can't be
invalid now either. Either way, everything is fine.

With the proposed patch, we might never lock some of the relations
involved in the query. Therefore, if one of those relations has been
modified in some way that would invalidate the plan, we will
potentially fail to discover this, and will use the plan anyway. For
instance, suppose there's one particular partition that has an extra
index and the plan involves an Index Scan using that index. Now
suppose that the scan of the partition in question is pruned, but
meanwhile, the index has been dropped. Now we're running a plan that
scans a nonexistent index. Admittedly, we're not running that part of
the plan. But is that enough for this to be safe? There are things
(like EXPLAIN or auto_explain) that we might try to do even on a part
of the plan tree that we don't try to run. Those things might break,
because for example we won't be able to look up the name of an index
in the catalogs for EXPLAIN output if the index is gone.

This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example. The point is that to my knowledge we have no existing
case where we try to use a plan that might be only partly valid, so if
we introduce one, there's some risk there. I thought for a while, too,
about whether changes to some object in a part of the plan that we're
not executing could break things for the rest of the plan even if we
never do anything with the plan but execute it. I can't quite see any
actual hazard. For example, I thought about whether we might try to
get the tuple descriptor for the pruned-away object and get a
different tuple descriptor than we were expecting. I think we can't,
because (1) the pruned object has to be a partition, and tuple
descriptors have to match throughout the partitioning hierarchy,
except for column ordering, which currently can't be changed
after-the-fact and (2) IIRC, the tuple descriptor is stored in the
plan and not reconstructed at runtime and (3) if we don't end up
opening the relation because it's pruned, then we certainly can't do
anything with its tuple descriptor. But it might be worth giving more
thought to the question of whether there's any other way we could be
depending on the details of an object that ended up getting pruned.

> Note that "initial" pruning steps are now performed twice when
> executing generic plans: once in AcquireExecutorLocks() to find
> partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> determine the set of partition sub-nodes to be initialized for
> execution, though I wasn't able to come up with a good idea to avoid
> this duplication.

I think this is something that will need to be fixed somehow. Apart
from the CPU cost, it's scary to imagine that the set of nodes on
which we acquired locks might be different from the set of nodes that
we initialize. If we do the same computation twice, there must be some
non-zero probability of getting a different answer the second time,
even if the circumstances under which it would actually happen are
remote. Consider, for example, a function that is labeled IMMUTABLE
but is really VOLATILE. Now maybe you can get the system to lock one
set of partitions and then initialize a different set of partitions. I
don't think we want to try to reason about what consequences that
might have and prove that somehow it's going to be OK; I think we want
to nail the door shut very tightly to make sure that it can't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Thanks for taking the time to look at this.

On Wed, Jan 12, 2022 at 1:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > However, using an idea that Robert suggested to me off-list a little
> > while back, it seems possible to determine the set of partitions that
> > we can safely skip locking.  The idea is to look at the "initial" or
> > "pre-execution" pruning instructions contained in a given Append or
> > MergeAppend node when AcquireExecutorLocks() is collecting the
> > relations to lock and consider relations from only those sub-nodes
> > that survive performing those instructions.   I've attempted
> > implementing that idea in the attached patch.
>
> Hmm. The first question that occurs to me is whether this is fully safe.
>
> Currently, AcquireExecutorLocks calls LockRelationOid for every
> relation involved in the query. That means we will probably lock at
> least one relation on which we previously had no lock and thus
> AcceptInvalidationMessages(). That will end up marking the query as no
> longer valid and CheckCachedPlan() will realize this and tell the
> caller to replan. In the corner case where we already hold all the
> required locks, we will not accept invalidation messages at this
> point, but must have done so after acquiring the last of the locks
> required, and if that didn't mark the plan invalid, it can't be
> invalid now either. Either way, everything is fine.
>
> With the proposed patch, we might never lock some of the relations
> involved in the query. Therefore, if one of those relations has been
> modified in some way that would invalidate the plan, we will
> potentially fail to discover this, and will use the plan anyway.  For
> instance, suppose there's one particular partition that has an extra
> index and the plan involves an Index Scan using that index. Now
> suppose that the scan of the partition in question is pruned, but
> meanwhile, the index has been dropped. Now we're running a plan that
> scans a nonexistent index. Admittedly, we're not running that part of
> the plan. But is that enough for this to be safe? There are things
> (like EXPLAIN or auto_explain) that we might try to do even on a part
> of the plan tree that we don't try to run. Those things might break,
> because for example we won't be able to look up the name of an index
> in the catalogs for EXPLAIN output if the index is gone.
>
> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example. The point is that to my knowledge we have no existing
> case where we try to use a plan that might be only partly valid, so if
> we introduce one, there's some risk there. I thought for a while, too,
> about whether changes to some object in a part of the plan that we're
> not executing could break things for the rest of the plan even if we
> never do anything with the plan but execute it. I can't quite see any
> actual hazard. For example, I thought about whether we might try to
> get the tuple descriptor for the pruned-away object and get a
> different tuple descriptor than we were expecting. I think we can't,
> because (1) the pruned object has to be a partition, and tuple
> descriptors have to match throughout the partitioning hierarchy,
> except for column ordering, which currently can't be changed
> after-the-fact and (2) IIRC, the tuple descriptor is stored in the
> plan and not reconstructed at runtime and (3) if we don't end up
> opening the relation because it's pruned, then we certainly can't do
> anything with its tuple descriptor. But it might be worth giving more
> thought to the question of whether there's any other way we could be
> depending on the details of an object that ended up getting pruned.

I have pondered on the possible hazards before writing the patch,
mainly because the concerns about a previously discussed proposal were
along similar lines [1].

IIUC, you're saying the plan tree is subject to inspection by non-core
code before ExecutorStart() has initialized a PlanState tree, which
must have discarded pruned portions of the plan tree.  I wouldn't
claim to have scanned *all* of the core code that could possibly
access the invalidated portions of the plan tree, but from what I have
seen, I couldn't find any site that does.  An ExecutorStart_hook()
gets to do that, but from what I can see it is expected to call
standard_ExecutorStart() before doing its thing and supposedly only
looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
also does ExecutorStart() before starting to look at the plan (the
PlanState tree), so must not run into pruned plan tree nodes.  All
that said, it does sound like wishful thinking to say that no problems
can possibly occur.

At first, I had tried to implement this such that the
Append/MergeAppend nodes are edited to record the result of initial
pruning, but it felt wrong to be munging the plan tree in plancache.c.

Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.

> > Note that "initial" pruning steps are now performed twice when
> > executing generic plans: once in AcquireExecutorLocks() to find
> > partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> > determine the set of partition sub-nodes to be initialized for
> > execution, though I wasn't able to come up with a good idea to avoid
> > this duplication.
>
> I think this is something that will need to be fixed somehow. Apart
> from the CPU cost, it's scary to imagine that the set of nodes on
> which we acquired locks might be different from the set of nodes that
> we initialize. If we do the same computation twice, there must be some
> non-zero probability of getting a different answer the second time,
> even if the circumstances under which it would actually happen are
> remote. Consider, for example, a function that is labeled IMMUTABLE
> but is really VOLATILE. Now maybe you can get the system to lock one
> set of partitions and then initialize a different set of partitions. I
> don't think we want to try to reason about what consequences that
> might have and prove that somehow it's going to be OK; I think we want
> to nail the door shut very tightly to make sure that it can't.

Yeah, the premise of the patch is that "initial" pruning steps produce
the same result both times.  I assumed that would be true because the
pruning steps are not allowed to contain any VOLATILE expressions.
Regarding the possibility that IMMUTABLE labeling of functions may be
incorrect, I haven't considered if the runtime pruning code can cope
or whether it should try to.  If such a case does occur in practice,
the bad outcome would be an Assert failure in
ExecGetRangeTableRelation() or using a partition unlocked in the
non-assert builds, the latter of which feels especially bad.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BTgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc%2B7%2BZ7pyNg%40mail.gmail.com



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> I have pondered on the possible hazards before writing the patch,
> mainly because the concerns about a previously discussed proposal were
> along similar lines [1].

True. I think that the hazards are narrower with this proposal,
because if you *delay* locking a partition that you eventually need,
then you might end up trying to actually execute a portion of the plan
that's no longer valid. That seems like hopelessly bad news. On the
other hand, with this proposal, you skip locking altogether, but only
for parts of the plan that you don't plan to execute. That's still
kind of scary, but not to nearly the same degree.

> IIUC, you're saying the plan tree is subject to inspection by non-core
> code before ExecutorStart() has initialized a PlanState tree, which
> must have discarded pruned portions of the plan tree.  I wouldn't
> claim to have scanned *all* of the core code that could possibly
> access the invalidated portions of the plan tree, but from what I have
> seen, I couldn't find any site that does.  An ExecutorStart_hook()
> gets to do that, but from what I can see it is expected to call
> standard_ExecutorStart() before doing its thing and supposedly only
> looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
> also does ExecutorStart() before starting to look at the plan (the
> PlanState tree), so must not run into pruned plan tree nodes.  All
> that said, it does sound like wishful thinking to say that no problems
> can possibly occur.

Yeah. I don't think it's only non-core code we need to worry about
either. What if I just do EXPLAIN ANALYZE on a prepared query that
ends up pruning away some stuff? IIRC, the pruned subplans are not
shown, so we might escape disaster here, but FWIW if I'd committed
that code I would have pushed hard for showing those and saying "(not
executed)" .... so it's not too crazy to imagine a world in which
things work that way.

> At first, I had tried to implement this such that the
> Append/MergeAppend nodes are edited to record the result of initial
> pruning, but it felt wrong to be munging the plan tree in plancache.c.

It is. You can't munge the plan tree: it's required to be strictly
read-only once generated. It can be serialized and deserialized for
transmission to workers, and it can be shared across executions.

> Or, maybe this won't be a concern if performing ExecutorStart() is
> made a part of CheckCachedPlan() somehow, which would then take locks
> on the relation as the PlanState tree is built capturing any plan
> invalidations, instead of AcquireExecutorLocks(). That does sound like
> an ambitious undertaking though.

On the surface that would seem to involve abstraction violations, but
maybe that could be finessed somehow. The plancache shouldn't know too
much about what the executor is going to do with the plan, but it
could ask the executor to perform a step that has been designed for
use by the plancache. I guess the core problem here is how to pass
around information that is node-specific before we've stood up the
executor state tree. Maybe the executor could have a function that
does the pruning and returns some kind of array of results that can be
used both to decide what to lock and also what to consider as pruned
at the start of execution. (I'm hand-waving about the details because
I don't know.)

> Yeah, the premise of the patch is that "initial" pruning steps produce
> the same result both times.  I assumed that would be true because the
> pruning steps are not allowed to contain any VOLATILE expressions.
> Regarding the possibility that IMMUTABLE labeling of functions may be
> incorrect, I haven't considered if the runtime pruning code can cope
> or whether it should try to.  If such a case does occur in practice,
> the bad outcome would be an Assert failure in
> ExecGetRangeTableRelation() or using a partition unlocked in the
> non-assert builds, the latter of which feels especially bad.

Right. I think it's OK for a query to produce wrong answers under
those kinds of conditions - the user has broken everything and gets to
keep all the pieces - but doing stuff that might violate fundamental
assumptions of the system like "relations can only be accessed when
holding a lock on them" feels quite bad. It's not a stretch to imagine
that failing to follow those invariants could take the whole system
down, which is clearly too severe a consequence for the user's failure
to label things properly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
> Here are few comments for v1 patch:

Thanks Amul.  I'm thinking about Robert's latest comments addressing
which may need some rethinking of this whole design, but I decided to
post a v2 taking care of your comments.

> +   /* Caller error if we get here without contains_init_steps */
> +   Assert(pruneinfo->contains_init_steps);
>
> -       prunedata = prunestate->partprunedata[i];
> -       pprune = &prunedata->partrelprunedata[0];
>
> -       /* Perform pruning without using PARAM_EXEC Params */
> -       find_matching_subplans_recurse(prunedata, pprune, true, &result);
> +   if (parentrelids)
> +       *parentrelids = NULL;
>
> You got two blank lines after Assert.

Fixed.

> --
>
> +   /* Set up EState if not in the executor proper. */
> +   if (estate == NULL)
> +   {
> +       estate = CreateExecutorState();
> +       estate->es_param_list_info = params;
> +       free_estate = true;
>     }
>
> ... [Skip]
>
> +   if (free_estate)
> +   {
> +       FreeExecutorState(estate);
> +       estate = NULL;
>     }
>
> I think this work should be left to the caller.

Done.  Also see below...

>     /*
>      * Stuff that follows matches exactly what ExecCreatePartitionPruneState()
>      * does, except we don't need a PartitionPruneState here, so don't call
>      * that function.
>      *
>      * XXX some refactoring might be good.
>      */
>
> +1, while doing it would be nice if foreach_current_index() is used
> instead of the i & j sequence in the respective foreach() block, IMO.

Actually, I rewrote this part quite significantly so that most of the
code remains in its existing place.  I decided to let
GetLockableRelations_walker() create a PartitionPruneState and pass
that to ExecFindInitialMatchingSubPlans() that is now left more or
less as is.  Instead, ExecCreatePartitionPruneState() is changed to be
callable from outside the executor.

The temporary EState is no longer necessary.  ExprContext,
PartitionDirectory, etc. are now managed in the caller,
GetLockableRelations_walker().

> --
>
> +                   while ((i = bms_next_member(validsubplans, i)) >= 0)
> +                   {
> +                       Plan   *subplan = list_nth(subplans, i);
> +
> +                       context->relations =
> +                           bms_add_members(context->relations,
> +                                           get_plan_scanrelids(subplan));
> +                   }
>
> I think instead of get_plan_scanrelids() the
> GetLockableRelations_worker() can be used; if so, then no need to add
> get_plan_scanrelids() function.

You're right, done.

> --
>
>      /* Nodes containing prunable subnodes. */
> +       case T_MergeAppend:
> +           {
> +               PlannedStmt *plannedstmt = context->plannedstmt;
> +               List       *rtable = plannedstmt->rtable;
> +               ParamListInfo params = context->params;
> +               PartitionPruneInfo *pruneinfo;
> +               Bitmapset  *validsubplans;
> +               Bitmapset  *parentrelids;
>
> ...
>                 if (pruneinfo && pruneinfo->contains_init_steps)
>                 {
>                     int     i;
> ...
>                    return false;
>                 }
>             }
>             break;
>
> Most of the declarations need to be moved inside the if-block.

Done.

> Also, initially, I was a bit concerned regarding this code block
> inside GetLockableRelations_worker(), what if (pruneinfo &&
> pruneinfo->contains_init_steps) evaluated to false? After debugging I
> realized that plan_tree_walker() will do the needful -- a bit of
> comment would have helped.

You're right.  Added a dummy else {} block with just the comment saying so.

> +       case T_CustomScan:
> +           foreach(lc, ((CustomScan *) plan)->custom_plans)
> +           {
> +               if (walker((Plan *) lfirst(lc), context))
> +                   return true;
> +           }
> +           break;
>
> Why not plan_walk_members() call like other nodes?

Makes sense, done.

Again, most/all of this patch might need to be thrown away, but here
it is anyway.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 14, 2022 at 11:10 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jan 6, 2022 at 3:45 PM Amul Sul <sulamul@gmail.com> wrote:
> > Here are few comments for v1 patch:
>
> Thanks Amul.  I'm thinking about Robert's latest comments addressing
> which may need some rethinking of this whole design, but I decided to
> post a v2 taking care of your comments.

cfbot tells me there is an unused variable warning, which is fixed in
the attached v3.


--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Simon Riggs
Date:
On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:

> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example.

I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.

Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.

Please excuse me if this idea falls short, if so, please just note my
comment about how important this is. Thanks.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Simon,

On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
> > This is just a relatively simple example and I think there are
> > probably a bunch of others. There are a lot of kinds of DDL that could
> > be performed on a partition that gets pruned away: DROP INDEX is just
> > one example.
>
> I haven't followed this in any detail, but this patch and its goal of
> reducing the O(N) drag effect on partition execution time is very
> important. Locking a long list of objects that then get pruned is very
> wasteful, as the results show.
>
> Ideally, we want an O(1) algorithm for single partition access and DDL
> is rare. So perhaps that is the starting point for a safe design -
> invent a single lock or cache that allows us to check if the partition
> hierarchy has changed in any way, and if so, replan, if not, skip
> locks.

Rearchitecting partition locking to be O(1) seems like a project of
non-trivial complexity as Robert mentioned in a related email thread
couple of years ago:

https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com

Pursuing that kind of a project would perhaps have been more
worthwhile if the locking issue had affected more than just this
particular case, that is, the case of running prepared statements over
partitioned tables using generic plans.  Addressing this by
rearchitecting run-time pruning (and plancache to some degree) seemed
like it might lead to this getting fixed in a bounded timeframe.  I
admit that the concerns that Robert has raised about the patch make me
want to reconsider that position, though maybe it's too soon to
conclude.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Simon Riggs
Date:
On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Simon,
>
> On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > On Tue, 11 Jan 2022 at 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
> > > This is just a relatively simple example and I think there are
> > > probably a bunch of others. There are a lot of kinds of DDL that could
> > > be performed on a partition that gets pruned away: DROP INDEX is just
> > > one example.
> >
> > I haven't followed this in any detail, but this patch and its goal of
> > reducing the O(N) drag effect on partition execution time is very
> > important. Locking a long list of objects that then get pruned is very
> > wasteful, as the results show.
> >
> > Ideally, we want an O(1) algorithm for single partition access and DDL
> > is rare. So perhaps that is the starting point for a safe design -
> > invent a single lock or cache that allows us to check if the partition
> > hierarchy has changed in any way, and if so, replan, if not, skip
> > locks.
>
> Rearchitecting partition locking to be O(1) seems like a project of
> non-trivial complexity as Robert mentioned in a related email thread
> couple of years ago:
>
> https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com

I agree, completely redesigning locking is a major project. But that
isn't what I suggested, which was to find an O(1) algorithm to solve
the safety issue. I'm sure there is an easy way to check one lock,
maybe a new one/new kind, rather than N.

Why does the safety issue exist? Why is it important to be able to
concurrently access parts of the hierarchy with DDL? Those are not
critical points.

If we asked them, most users would trade a 10x performance gain for
some restrictions on DDL. If anyone cares, make it an option, but most
people will use it.

Maybe force all DDL, or just DDL that would cause safety issues, to
update a hierarchy version number, so queries can tell whether they
need to replan. Don't know, just looking for an O(1) solution.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Pursuing that kind of a project would perhaps have been more
> worthwhile if the locking issue had affected more than just this
> particular case, that is, the case of running prepared statements over
> partitioned tables using generic plans.  Addressing this by
> rearchitecting run-time pruning (and plancache to some degree) seemed
> like it might lead to this getting fixed in a bounded timeframe.  I
> admit that the concerns that Robert has raised about the patch make me
> want to reconsider that position, though maybe it's too soon to
> conclude.

I wasn't trying to say that your approach was dead in the water. It
does create a situation that can't happen today, and such things are
scary and need careful thought. But redesigning the locking mechanism
would need careful thought, too ... maybe even more of it than sorting
this out.

I do also agree with Simon that this is an important problem to which
we need to find some solution.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Jan 18, 2022 at 7:28 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> On Tue, 18 Jan 2022 at 08:10, Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Jan 18, 2022 at 4:44 PM Simon Riggs
> > <simon.riggs@enterprisedb.com> wrote:
> > > I haven't followed this in any detail, but this patch and its goal of
> > > reducing the O(N) drag effect on partition execution time is very
> > > important. Locking a long list of objects that then get pruned is very
> > > wasteful, as the results show.
> > >
> > > Ideally, we want an O(1) algorithm for single partition access and DDL
> > > is rare. So perhaps that is the starting point for a safe design -
> > > invent a single lock or cache that allows us to check if the partition
> > > hierarchy has changed in any way, and if so, replan, if not, skip
> > > locks.
> >
> > Rearchitecting partition locking to be O(1) seems like a project of
> > non-trivial complexity as Robert mentioned in a related email thread
> > couple of years ago:
> >
> > https://www.postgresql.org/message-id/CA%2BTgmoYbtm1uuDne3rRp_uNA2RFiBwXX1ngj3RSLxOfc3oS7cQ%40mail.gmail.com
>
> I agree, completely redesigning locking is a major project. But that
> isn't what I suggested, which was to find an O(1) algorithm to solve
> the safety issue. I'm sure there is an easy way to check one lock,
> maybe a new one/new kind, rather than N.

I misread your email then, sorry.

> Why does the safety issue exist? Why is it important to be able to
> concurrently access parts of the hierarchy with DDL? Those are not
> critical points.
>
> If we asked them, most users would trade a 10x performance gain for
> some restrictions on DDL. If anyone cares, make it an option, but most
> people will use it.
>
> Maybe force all DDL, or just DDL that would cause safety issues, to
> update a hierarchy version number, so queries can tell whether they
> need to replan. Don't know, just looking for an O(1) solution.

Yeah, it would be great if it would suffice to take a single lock on
the partitioned table mentioned in the query, rather than on all
elements of the partition tree added to the plan.  AFAICS, ways to get
that are 1) Prevent modifying non-root partition tree elements, 2)
Make it so that locking a partitioned table becomes a proxy for having
locked all of its descendents, 3) Invent a Plan representation for
scanning partitioned tables such that adding the descendent tables
that survive plan-time pruning to the plan doesn't require locking
them too.  IIUC, you've mentioned 1 and 2.  I think I've seen 3
mentioned in the past discussions on this topic, but I guess the
research on whether that's doable has never been done.


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Jan 18, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 18, 2022 at 3:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Pursuing that kind of a project would perhaps have been more
> > worthwhile if the locking issue had affected more than just this
> > particular case, that is, the case of running prepared statements over
> > partitioned tables using generic plans.  Addressing this by
> > rearchitecting run-time pruning (and plancache to some degree) seemed
> > like it might lead to this getting fixed in a bounded timeframe.  I
> > admit that the concerns that Robert has raised about the patch make me
> > want to reconsider that position, though maybe it's too soon to
> > conclude.
>
> I wasn't trying to say that your approach was dead in the water. It
> does create a situation that can't happen today, and such things are
> scary and need careful thought. But redesigning the locking mechanism
> would need careful thought, too ... maybe even more of it than sorting
> this out.

Yes, agreed.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Simon Riggs
Date:
On Wed, 19 Jan 2022 at 08:31, Amit Langote <amitlangote09@gmail.com> wrote:

> > Maybe force all DDL, or just DDL that would cause safety issues, to
> > update a hierarchy version number, so queries can tell whether they
> > need to replan. Don't know, just looking for an O(1) solution.
>
> Yeah, it would be great if it would suffice to take a single lock on
> the partitioned table mentioned in the query, rather than on all
> elements of the partition tree added to the plan.  AFAICS, ways to get
> that are 1) Prevent modifying non-root partition tree elements,

Can we reuse the concept of Strong/Weak locking here?

When a DDL request is in progress (for that partitioned table), take
all required locks for safety. When a DDL request is not in progress,
take minimal locks knowing it is safe.

We can take a single PartitionTreeModificationLock, nowait to prove
that we do not need all locks. DDL would request the lock in exclusive
mode. (Other mechanisms possible).

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Jan 13, 2022 at 3:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Or, maybe this won't be a concern if performing ExecutorStart() is
> > made a part of CheckCachedPlan() somehow, which would then take locks
> > on the relation as the PlanState tree is built capturing any plan
> > invalidations, instead of AcquireExecutorLocks(). That does sound like
> > an ambitious undertaking though.
>
> On the surface that would seem to involve abstraction violations, but
> maybe that could be finessed somehow. The plancache shouldn't know too
> much about what the executor is going to do with the plan, but it
> could ask the executor to perform a step that has been designed for
> use by the plancache. I guess the core problem here is how to pass
> around information that is node-specific before we've stood up the
> executor state tree. Maybe the executor could have a function that
> does the pruning and returns some kind of array of results that can be
> used both to decide what to lock and also what to consider as pruned
> at the start of execution. (I'm hand-waving about the details because
> I don't know.)

The attached patch implements this idea.  Sorry for the delay in
getting this out and thanks to Robert for the off-list discussions on
this.

So the new executor "step" you mention is the function ExecutorPrep in
the patch, which calls a recursive function ExecPrepNode on the plan
tree's top node, much as ExecutorStart calls (via InitPlan)
ExecInitNode to construct a PlanState tree for actual execution
paralleling the plan tree.

For now, ExecutorPrep() / ExecPrepNode() does mainly two things if and
as it walks the plan tree: 1) Extract the RT indexes of RTE_RELATION
entries and add them to a bitmapset in the result struct, 2) If the
node contains a PartitionPruneInfo, perform its "initial pruning
steps" and store the result of doing so in a per-plan-node node called
PlanPrepOutput.  The bitmapset and the array containing per-plan-node
PlanPrepOutput nodes are returned in a node called ExecPrepOutput,
which is the result of ExecutorPrep, to its calling module (say,
plancache.c), which, after it's done using that information, must pass
it forward to subsequent execution steps.  That is done by passing it,
via the module's callers, to CreateQueryDesc() which remembers the
ExecPrepOutput in QueryDesc that is eventually passed to
ExecutorStart().

A bunch of other details are mentioned in the patch's commit message,
which I'm pasting below for anyone reading to spot any obvious flaws
(no-go's) of this approach:

    Invent a new executor "prep" phase

    The new phase, implemented by execMain.c:ExecutorPrep() and its
    recursive underling execProcnode.c:ExecPrepNode(), takes a query's
    PlannedStmt and processes the plan tree contained in it to produce
    a ExecPrepOutput node as result.

    As the plan tree is walked, each node must add the RT index(es) of
    any relation(s) that it directly manipulates to a bitmapset member of
    ExecPrepOutput (for example, an IndexScan node must add the Scan's
    scanrelid).  Also, each node may want to make a PlanPrepOutput node
    containing additional information that may be of interest to the
    calling module or to the later execution phases, if the node can
    provide one (for example, an Append node may perform initial pruning
    and add a set of "initially valid subplans" to the PlanPrepOutput).
    The PlanPrepOutput nodess of all the plan nodes are added to an array
    in the ExecPrepOutput, which is indexed using the individual nodes'
    plan_node_id; a NULL is stored in the array slots of nodes that
    don't have anything interesting to add to the PlanPrepOutput.

    The ExecPrepOutput thus produced is passed to CreateQueryDesc()
    and subsequently to ExecutorStart() via QueryDesc, which then makes
    it available to the executor routines via the query's EState.

    The main goal of adding this new phase is, for now, to allow cached
    cached generic plans containing scans of partitioned tables using
    Append/MergeAppend to be executed more efficiently by the prep phase
    doing any initial pruning, instead of deferring that to
    ExecutorStart().  That may allow AcquireExecutorLocks() on the plan
    to lock only only the minimal set of relations/partitions, that is
    those whose subplans survive the initial pruning.

    Implementation notes:

    * To allow initial pruning to be done as part of the pre-execution
    prep phase as opposed to as part of ExecutorStart(), this refactors
    ExecCreatePartitionPruneState() and ExecFindInitialMatchingSubPlans()
    to pass the information needed to do initial pruning directly as
    parameters instead of getting that from the EState and the PlanState
    of the parent Append/MergeAppend, both of which would not be
    available in ExecutorPrep().  Another, sort of non-essential-to-this-
    goal, refactoring this does is moving the partition pruning
    initialization stanzas in ExecInitAppend() and ExecInitMergeAppend()
    both of which contain the same cod into its own function
    ExecInitPartitionPruning().

    * To pass the ExecPrepOutput(s) created by the plancache module's
    invocation of ExecutorPrep() to the callers of the module, which in
    turn would pass them down to ExecutorStart(), CachedPlan gets a new
    List field that stores those ExecPrepOutputs, containing one element
    for each PlannedStmt also contained in the CachedPlan.  The new list
    is stored in a child context of the context containing the
    PlannedStmts, though unlike the latter, it is reset on every
    invocation of CheckCachedPlan(), which in turn calls ExecutorPrep()
    with a new set of bound Params.

    * AcquireExecutorLocks() is now made to loop over a bitmapset of RT
    indexes, those of relations returned in ExecPrepOutput, instead of
    over the whole range table.  With initial pruning that is also done
    as part of ExcecutorPrep(), only relations from non-pruned nodes of
    the plan tree would get locked as a result of this new arrangement.

    * PlannedStmt gets a new field usesPrepExecPruning that indicates
    whether any of the nodes of the plan tree contain "initial" (or
    "pre-execution") pruning steps, which saves ExecutorPrep() the
    trouble of walking the plan tree only to find out whether that's
    the case.

    * PartitionPruneInfo nodes now explicitly stores whether the steps
    contained in any of the individual PartitionedRelPruneInfos embedded
    in it contain initial pruning steps (those that can be performed
    during ExecutorPrep) and execution pruning steps (those that can only
    be performed during ExecutorRun), as flags contains_initial_steps and
    contains_exec_steps, respectively.  In fact, the aforementioned
    PlannedStmt field's value is a logical OR of the values of the former
    across all PartitionPruneInfo nodes embedded in the plan tree.

    * PlannedStmt also gets a bitmapset field to store the RT indexes of
    all relation RTEs referenced in the query that is populated when
    contructing the flat range table in setrefs.c, which effectively
    contains all the relations that the planner must have locked. In the
    case of a cached plan, AcquireExecutorLocks() must lock all of those
    relations, except those whose subnodes get pruned as result of
    ExecutorPrep().

    * PlannedStmt gets yet another field numPlanNodes that records the
    highest plan_node_id assigned to any of the node contained in the
    tree, which serves as the size to use when allocating the
    PlanPrepOutput array.

Maybe this should be more than one patch?  Say:

0001 to add ExecutorPrep and the boilerplate,
0002 to teach plancache.c to use the new facility

Thoughts?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Maybe this should be more than one patch?  Say:
>
> 0001 to add ExecutorPrep and the boilerplate,
> 0002 to teach plancache.c to use the new facility

Could be, not sure. I agree that if it's possible to split this in a
meaningful way, it would facilitate review. I notice that there is
some straight code movement e.g. the creation of
ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do
pure code movement in a preparatory patch so that the main patch is
just adding the new stuff we need and not moving stuff around.

David Rowley recently proposed a patch for some parallel-safety
debugging cross checks which added a plan tree walker. I'm not sure
whether he's going to press that patch forward to commit, but I think
we should get something like that into the tree and start using it,
rather than adding more bespoke code. Maybe you/we should steal that
part of his patch and commit it separately. What I'm imagining is that
plan_tree_walker() would know which nodes have subnodes and how to
recurse over the tree structure, and you'd have a walker function to
use with it that would know which executor nodes have ExecPrep
functions and call them, and just do nothing for the others. That
would spare you adding stub functions for nodes that don't need to do
anything, or don't need to do anything other than recurse. Admittedly
it would look a bit different from the existing executor phases, but
I'd argue that it's a better coding model.

Actually, you might've had this in the patch at some point, because
you have a declaration for plan_tree_walker but no implementation. I
guess one thing that's a bit awkward about this idea is that in some
cases you want to recurse to some subnodes but not other subnodes. But
maybe it would work to put the recursion in the walker function in
that case, and then just return true; but if you want to walk all
children, return false.

+ bool contains_init_steps;
+ bool contains_exec_steps;

s/steps/pruning/? maybe with contains -> needs or performs or requires as well?

+ * Returned information includes the set of RT indexes of relations referenced
+ * in the plan, and a PlanPrepOutput node for each node in the planTree if the
+ * node type supports producing one.

Aren't all RT indexes referenced in the plan?

+ * This may lock relations whose information may be used to produce the
+ * PlanPrepOutput nodes. For example, a partitioned table before perusing its
+ * PartitionPruneInfo contained in an Append node to do the pruning the result
+ * of which is used to populate the Append node's PlanPrepOutput.

"may lock" feels awfully fuzzy to me. How am I supposed to rely on
something that "may" happen? And don't we need to have tight logic
around locking, with specific guarantees about what is locked at which
points in the code and what is not?

+ * At least one of 'planstate' or 'econtext' must be passed to be able to
+ * successfully evaluate any non-Const expressions contained in the
+ * steps.

This also seems fuzzy. If I'm thinking of calling this function, I
don't know how I'd know whether this criterion is met.

I don't love PlanPrepOutput the way you have it. I think one of the
basic design issues for this patch is: should we think of the prep
phase as specifically pruning, or is it general prep and pruning is
the first thing for which we're going to use it? If it's really a
pre-pruning phase, we could name it that way instead of calling it
"prep". If it's really a general prep phase, then why does
PlanPrepOutput contain initially_valid_subnodes as a field? One could
imagine letting each prep function decide what kind of prep node it
would like to return, with partition pruning being just one of the
options. But is that a useful generalization of the basic concept, or
just pretending that a special-purpose mechanism is more general than
it really is?

+ return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */

It seems to me that we should do what the XXX suggests. It doesn't
seem nice if the parallel workers could theoretically decide to prune
a different set of nodes than the leader.

+ * known at executor startup (excludeing expressions containing

Extra e.

+ * into subplan indexes, is also returned for use during subsquent

Missing e.

Somewhere, we're going to need to document the idea that this may
permit us to execute a plan that isn't actually fully valid, but that
we expect to survive because we'll never do anything with the parts of
it that aren't. Maybe that should be added to the executor README, or
maybe there's some better place, but I don't think that should remain
something that's just implicit.

This is not a full review, just some initial thoughts looking through this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Andres Freund
Date:
Hi,

On 2022-02-10 17:13:52 +0900, Amit Langote wrote:
> The attached patch implements this idea.  Sorry for the delay in
> getting this out and thanks to Robert for the off-list discussions on
> this.

I did not follow this thread at all. And I only skimmed the patch. So I'm
probably wrong.

I'm a wary of this increasing executor overhead even in cases it won't
help. Without this patch, for simple queries, I see small allocations
noticeably in profiles. This adds a bunch more, even if
!context->stmt->usesPreExecPruning:

- makeNode(ExecPrepContext)
- makeNode(ExecPrepOutput)
- palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes)
- stmt_execprep_list = lappend(stmt_execprep_list, execprep);
- AllocSetContextCreate(CurrentMemoryContext,
  "CachedPlan execprep list", ...
- ...

That's a lot of extra for something that's already a bottleneck.

Greetings,

Andres Freund



Re: generic plans and "initial" pruning

From
David Rowley
Date:
(just catching up on this thread)

On Thu, 13 Jan 2022 at 07:20, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah. I don't think it's only non-core code we need to worry about
> either. What if I just do EXPLAIN ANALYZE on a prepared query that
> ends up pruning away some stuff? IIRC, the pruned subplans are not
> shown, so we might escape disaster here, but FWIW if I'd committed
> that code I would have pushed hard for showing those and saying "(not
> executed)" .... so it's not too crazy to imagine a world in which
> things work that way.

FWIW, that would remove the whole point in init run-time pruning.  The
reason I made two phases of run-time pruning was so that we could get
away from having the init plan overhead of nodes we'll never need to
scan.  If we wanted to show the (never executed) scans in EXPLAIN then
we'd need to do the init plan part and allocate all that memory
needlessly.

Imagine a hash partitioned table on "id" with 1000 partitions. The user does:

PREPARE q1 (INT) AS SELECT * FROM parttab WHERE id = $1;

EXECUTE q1(123);

Assuming a generic plan, if we didn't have init pruning then we have
to build a plan containing the scans for all 1000 partitions. There's
significant overhead to that compared to just locking the partitions,
and initialising 1 scan.

If it worked this way then we'd be even further from Amit's goal of
reducing the overhead of starting plan with run-time pruning nodes.

I understood at the time it was just the EXPLAIN output that you had
concerns with. I thought that was just around the lack of any display
of the condition we used for pruning.

David



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Sun, Feb 13, 2022 at 4:55 PM David Rowley <dgrowleyml@gmail.com> wrote:
> FWIW, that would remove the whole point in init run-time pruning.  The
> reason I made two phases of run-time pruning was so that we could get
> away from having the init plan overhead of nodes we'll never need to
> scan.  If we wanted to show the (never executed) scans in EXPLAIN then
> we'd need to do the init plan part and allocate all that memory
> needlessly.

Interesting. I didn't realize that was why it had ended up like this.

> I understood at the time it was just the EXPLAIN output that you had
> concerns with. I thought that was just around the lack of any display
> of the condition we used for pruning.

That was part of it, but I did think it was surprising that we didn't
print anything at all about the nodes we pruned, too. Although we're
technically iterating over the PlanState, from the user perspective it
feels like you're asking PostgreSQL to print out the plan - so it
seems weird to have nodes in the Plan tree that are quietly omitted
from the output. That said, perhaps in retrospect it's good that it
ended up as it did, since we'd have a lot of trouble printing anything
sensible for a scan of a table that's since been dropped.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Andres,

On Fri, Feb 11, 2022 at 10:29 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-10 17:13:52 +0900, Amit Langote wrote:
> > The attached patch implements this idea.  Sorry for the delay in
> > getting this out and thanks to Robert for the off-list discussions on
> > this.
>
> I did not follow this thread at all. And I only skimmed the patch. So I'm
> probably wrong.

Thanks for your interest in this and sorry about the delay in replying
(have been away due to illness).

> I'm a wary of this increasing executor overhead even in cases it won't
> help. Without this patch, for simple queries, I see small allocations
> noticeably in profiles. This adds a bunch more, even if
> !context->stmt->usesPreExecPruning:

Ah, if any new stuff added by the patch runs in
!context->stmt->usesPreExecPruning paths, then it's just poor coding
on my part, which I'm now looking to fix.  Maybe not all of it is
avoidable, but I think whatever isn't should be trivial...

> - makeNode(ExecPrepContext)
> - makeNode(ExecPrepOutput)
> - palloc0(sizeof(PlanPrepOutput *) * result->numPlanNodes)
> - stmt_execprep_list = lappend(stmt_execprep_list, execprep);
> - AllocSetContextCreate(CurrentMemoryContext,
>   "CachedPlan execprep list", ...
> - ...
>
> That's a lot of extra for something that's already a bottleneck.

If all these allocations are limited to the usesPreExecPruning path,
IMO, they would amount to trivial overhead compared to what is going
to be avoided -- locking say 1000 partitions when only 1 will be
scanned.  Although, maybe there's a way to code this to have even less
overhead than what's in the patch now.
--
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 10, 2022 at 3:14 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Maybe this should be more than one patch?  Say:
> >
> > 0001 to add ExecutorPrep and the boilerplate,
> > 0002 to teach plancache.c to use the new facility

Thanks for taking a look and sorry about the delay.

> Could be, not sure. I agree that if it's possible to split this in a
> meaningful way, it would facilitate review. I notice that there is
> some straight code movement e.g. the creation of
> ExecPartitionPruneFixSubPlanIndexes. It would be best, I think, to do
> pure code movement in a preparatory patch so that the main patch is
> just adding the new stuff we need and not moving stuff around.

Okay, created 0001 for moving around the execution pruning code.

> David Rowley recently proposed a patch for some parallel-safety
> debugging cross checks which added a plan tree walker. I'm not sure
> whether he's going to press that patch forward to commit, but I think
> we should get something like that into the tree and start using it,
> rather than adding more bespoke code. Maybe you/we should steal that
> part of his patch and commit it separately.

I looked at the thread you mentioned (I guess [1]), though it seems
David's proposing a path_tree_walker(), so I guess only useful within
the planner and not here.

> What I'm imagining is that
> plan_tree_walker() would know which nodes have subnodes and how to
> recurse over the tree structure, and you'd have a walker function to
> use with it that would know which executor nodes have ExecPrep
> functions and call them, and just do nothing for the others. That
> would spare you adding stub functions for nodes that don't need to do
> anything, or don't need to do anything other than recurse. Admittedly
> it would look a bit different from the existing executor phases, but
> I'd argue that it's a better coding model.
>
> Actually, you might've had this in the patch at some point, because
> you have a declaration for plan_tree_walker but no implementation.

Right, the previous patch indeed used a plan_tree_walker() for this
and I think in a way you seem to think it should work.

I do agree that plan_tree_walker() allows for a better implementation
of the idea of this patch and may also be generally useful, so I've
created a separate patch that adds it to nodeFuncs.c.

> I guess one thing that's a bit awkward about this idea is that in some
> cases you want to recurse to some subnodes but not other subnodes. But
> maybe it would work to put the recursion in the walker function in
> that case, and then just return true; but if you want to walk all
> children, return false.

Right, that's how I've made ExecPrepAppend() etc. do it.

> + bool contains_init_steps;
> + bool contains_exec_steps;
>
> s/steps/pruning/? maybe with contains -> needs or performs or requires as well?

Went with: needs_{init|exec}_pruning

> + * Returned information includes the set of RT indexes of relations referenced
> + * in the plan, and a PlanPrepOutput node for each node in the planTree if the
> + * node type supports producing one.
>
> Aren't all RT indexes referenced in the plan?

Ah yes.  How about:

 * Returned information includes the set of RT indexes of relations that must
 * be locked to safely execute the plan,

> + * This may lock relations whose information may be used to produce the
> + * PlanPrepOutput nodes. For example, a partitioned table before perusing its
> + * PartitionPruneInfo contained in an Append node to do the pruning the result
> + * of which is used to populate the Append node's PlanPrepOutput.
>
> "may lock" feels awfully fuzzy to me. How am I supposed to rely on
> something that "may" happen? And don't we need to have tight logic
> around locking, with specific guarantees about what is locked at which
> points in the code and what is not?

Agree the wording was fuzzy.  I've rewrote as:

 * This locks relations whose information is needed to produce the
 * PlanPrepOutput nodes. For example, a partitioned table before perusing its
 * PartitionedRelPruneInfo contained in an Append node to do the pruning, the
 * result of which is used to populate the Append node's PlanPrepOutput.

BTW, I've added an Assert in ExecGetRangeTableRelation():

   /*
    * A cross-check that AcquireExecutorLocks() hasn't missed any relations
    * it must not have.
    */
   Assert(estate->es_execprep == NULL ||
          bms_is_member(rti, estate->es_execprep->relationRTIs));

which IOW ensures that the actual execution of a plan only sees
relations that ExecutorPrep() would've told AcquireExecutorLocks() to
take a lock on.

> + * At least one of 'planstate' or 'econtext' must be passed to be able to
> + * successfully evaluate any non-Const expressions contained in the
> + * steps.
>
> This also seems fuzzy. If I'm thinking of calling this function, I
> don't know how I'd know whether this criterion is met.

OK, I have removed this comment (which was on top of a static local
function) in favor of adding some commentary on this in places where
it belongs.  For example, in ExecPrepDoInitialPruning():

    /*
     * We don't yet have a PlanState for the parent plan node, so must create
     * a standalone ExprContext to evaluate pruning expressions, equipped with
     * the information about the EXTERN parameters that the caller passed us.
     * Note that that's okay because the initial pruning steps does not
     * involve anything that requires the execution to have started.
     */
    econtext = CreateStandaloneExprContext();
    econtext->ecxt_param_list_info = params;
    prunestate = ExecCreatePartitionPruneState(NULL, pruneinfo,
                                               true, false,
                                               rtable, econtext,
                                               pdir, parentrelids);

> I don't love PlanPrepOutput the way you have it. I think one of the
> basic design issues for this patch is: should we think of the prep
> phase as specifically pruning, or is it general prep and pruning is
> the first thing for which we're going to use it? If it's really a
> pre-pruning phase, we could name it that way instead of calling it
> "prep". If it's really a general prep phase, then why does
> PlanPrepOutput contain initially_valid_subnodes as a field? One could
> imagine letting each prep function decide what kind of prep node it
> would like to return, with partition pruning being just one of the
> options. But is that a useful generalization of the basic concept, or
> just pretending that a special-purpose mechanism is more general than
> it really is?

While it can feel like the latter TBH, I'm inclined to keep
ExecutorPrep generalized.   What bothers me about about the
alternative of calling the new phase something less generalized like
ExecutorDoInitPruning() is that that makes the somewhat elaborate API
changes needed for the phase's output to put into QueryDesc, through
which it ultimately reaches the main executor, seem less worthwhile.

I agree that PlanPrepOutput design needs to be likewise generalized,
maybe like you suggest -- using PlanInitPruningOutput, a child class
of PlanPrepOutput, to return the prep output for plan nodes that
support pruning.

Thoughts?

> + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */
>
> It seems to me that we should do what the XXX suggests. It doesn't
> seem nice if the parallel workers could theoretically decide to prune
> a different set of nodes than the leader.

OK, will fix.

> + * known at executor startup (excludeing expressions containing
>
> Extra e.
>
> + * into subplan indexes, is also returned for use during subsquent
>
> Missing e.

Will fix.

> Somewhere, we're going to need to document the idea that this may
> permit us to execute a plan that isn't actually fully valid, but that
> we expect to survive because we'll never do anything with the parts of
> it that aren't. Maybe that should be added to the executor README, or
> maybe there's some better place, but I don't think that should remain
> something that's just implicit.

Agreed.  I'd added a description of the new prep phase to executor
README, though the text didn't mention this particular bit.  Will fix
to mention it.

> This is not a full review, just some initial thoughts looking through this.

Thanks again. Will post a new version soon after a bit more polishing.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/flat/b59605fecb20ba9ea94e70ab60098c237c870628.camel%40postgrespro.ru



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Mar 7, 2022 at 11:18 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Feb 11, 2022 at 7:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't love PlanPrepOutput the way you have it. I think one of the
> > basic design issues for this patch is: should we think of the prep
> > phase as specifically pruning, or is it general prep and pruning is
> > the first thing for which we're going to use it? If it's really a
> > pre-pruning phase, we could name it that way instead of calling it
> > "prep". If it's really a general prep phase, then why does
> > PlanPrepOutput contain initially_valid_subnodes as a field? One could
> > imagine letting each prep function decide what kind of prep node it
> > would like to return, with partition pruning being just one of the
> > options. But is that a useful generalization of the basic concept, or
> > just pretending that a special-purpose mechanism is more general than
> > it really is?
>
> While it can feel like the latter TBH, I'm inclined to keep
> ExecutorPrep generalized.   What bothers me about about the
> alternative of calling the new phase something less generalized like
> ExecutorDoInitPruning() is that that makes the somewhat elaborate API
> changes needed for the phase's output to put into QueryDesc, through
> which it ultimately reaches the main executor, seem less worthwhile.
>
> I agree that PlanPrepOutput design needs to be likewise generalized,
> maybe like you suggest -- using PlanInitPruningOutput, a child class
> of PlanPrepOutput, to return the prep output for plan nodes that
> support pruning.
>
> Thoughts?

So I decided to agree with you after all about limiting the scope of
this new executor interface, or IOW call it what it is.

I have named it ExecutorGetLockRels() to go with the only use case we
know for it -- get the set of relations for AcquireExecutorLocks() to
lock to validate a plan tree.  Its result returned in a node named
ExecLockRelsInfo, which contains the set of relations scanned in the
plan tree (lockrels) and a list of PlanInitPruningOutput nodes for all
nodes that undergo pruning.

> > + return CreateQueryDesc(pstmt, NULL, /* XXX pass ExecPrepOutput too? */
> >
> > It seems to me that we should do what the XXX suggests. It doesn't
> > seem nice if the parallel workers could theoretically decide to prune
> > a different set of nodes than the leader.
>
> OK, will fix.

Done.  This required adding nodeToString() and stringToNode() support
for the nodes produced by the new executor function that wasn't there
before.

> > Somewhere, we're going to need to document the idea that this may
> > permit us to execute a plan that isn't actually fully valid, but that
> > we expect to survive because we'll never do anything with the parts of
> > it that aren't. Maybe that should be added to the executor README, or
> > maybe there's some better place, but I don't think that should remain
> > something that's just implicit.
>
> Agreed.  I'd added a description of the new prep phase to executor
> README, though the text didn't mention this particular bit.  Will fix
> to mention it.

Rewrote the comments above ExecutorGetLockRels() (previously
ExecutorPrep()) and the executor README text to be explicit about the
fact that not locking some relations effectively invalidates pruned
parts of the plan tree.

> > This is not a full review, just some initial thoughts looking through this.
>
> Thanks again. Will post a new version soon after a bit more polishing.

Attached is v5, now broken into 3 patches:

0001: Some refactoring of runtime pruning code
0002: Add a plan_tree_walker
0003: Teach AcquireExecutorLocks to skip locking pruned relations

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Mar 11, 2022 at 11:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached is v5, now broken into 3 patches:
>
> 0001: Some refactoring of runtime pruning code
> 0002: Add a plan_tree_walker
> 0003: Teach AcquireExecutorLocks to skip locking pruned relations

Repeated the performance tests described in the 1st email of this thread:

HEAD: (copied from the 1st email)

32      tps = 20561.776403 (without initial connection time)
64      tps = 12553.131423 (without initial connection time)
128     tps = 13330.365696 (without initial connection time)
256     tps = 8605.723120 (without initial connection time)
512     tps = 4435.951139 (without initial connection time)
1024    tps = 2346.902973 (without initial connection time)
2048    tps = 1334.680971 (without initial connection time)

Patched v1: (copied from the 1st email)

32      tps = 27554.156077 (without initial connection time)
64      tps = 27531.161310 (without initial connection time)
128     tps = 27138.305677 (without initial connection time)
256     tps = 25825.467724 (without initial connection time)
512     tps = 19864.386305 (without initial connection time)
1024    tps = 18742.668944 (without initial connection time)
2048    tps = 16312.412704 (without initial connection time)

Patched v5:

32      tps = 28204.197738 (without initial connection time)
64      tps = 26795.385318 (without initial connection time)
128     tps = 26387.920550 (without initial connection time)
256     tps = 25601.141556 (without initial connection time)
512     tps = 19911.947502 (without initial connection time)
1024    tps = 20158.692952 (without initial connection time)
2048    tps = 16180.195463 (without initial connection time)

Good to see that these rewrites haven't really hurt the numbers much,
which makes sense because the rewrites have really been about putting
the code in the right place.

BTW, these are the numbers for the same benchmark repeated with
plan_cache_mode = auto, which causes a custom plan to be chosen for
every execution and so unaffected by this patch.

32      tps = 13359.225082 (without initial connection time)
64      tps = 15760.533280 (without initial connection time)
128     tps = 15825.734482 (without initial connection time)
256     tps = 15017.693905 (without initial connection time)
512     tps = 13479.973395 (without initial connection time)
1024    tps = 13200.444397 (without initial connection time)
2048    tps = 12884.645475 (without initial connection time)

Comparing them to numbers when using force_generic_plan shows that
making the generic plans faster is indeed worthwhile.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Zhihong Yu
Date:
Hi,
w.r.t. v5-0003-Teach-AcquireExecutorLocks-to-skip-locking-pruned.patch :

(pruning steps containing expressions that can be computed before
before the executor proper has started)

the word 'before' was repeated.

For ExecInitParallelPlan():

+   char       *execlockrelsinfo_data;
+   char       *execlockrelsinfo_space;

the content of execlockrelsinfo_data is copied into execlockrelsinfo_space.
I wonder if having one of execlockrelsinfo_data and execlockrelsinfo_space suffices.

Cheers

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Mar 11, 2022 at 9:35 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached is v5, now broken into 3 patches:
>
> 0001: Some refactoring of runtime pruning code
> 0002: Add a plan_tree_walker
> 0003: Teach AcquireExecutorLocks to skip locking pruned relations

So is any other committer planning to look at this? Tom, perhaps?
David? This strikes me as important work, and I don't mind going
through and trying to do some detailed review, but (A) I am not the
person most familiar with the code being modified here and (B) there
are some important theoretical questions about the approach that we
might want to try to cover before we get down into the details.

In my opinion, the most important theoretical issue here is around
reuse of plans that are no longer entirely valid, but the parts that
are no longer valid are certain to be pruned. If, because we know that
some parameter has some particular value, we skip locking a bunch of
partitions, then when we're executing the plan, those partitions need
not exist any more -- or they could have different indexes, be
detached from the partitioning hierarchy and subsequently altered,
whatever. That seems fine to me provided that all of our code (and any
third-party code) is careful not to rely on the portion of the plan
that we've pruned away, and doesn't assume that (for example) we can
still fetch the name of an index whose OID appears in there someplace.
I cannot think of a hazard where the fact that the part of a plan is
no longer valid because some DDL has been executed "infects" the
remainder of the plan. As long as we lock the partitioned tables named
in the plan and their descendents down to the level just above the one
at which something is pruned, and are careful, I think we should be
OK. It would be nice to know if someone has a fundamentally different
view of the hazards here, though.

Just to state my position here clearly, I would be more than happy if
somebody else plans to pick this up and try to get some or all of it
committed, and will cheerfully defer to such person in the event that
they have that plan. If, however, no such person exists, I may try my
hand at that myself.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> In my opinion, the most important theoretical issue here is around
> reuse of plans that are no longer entirely valid, but the parts that
> are no longer valid are certain to be pruned. If, because we know that
> some parameter has some particular value, we skip locking a bunch of
> partitions, then when we're executing the plan, those partitions need
> not exist any more -- or they could have different indexes, be
> detached from the partitioning hierarchy and subsequently altered,
> whatever.

Check.

> That seems fine to me provided that all of our code (and any
> third-party code) is careful not to rely on the portion of the plan
> that we've pruned away, and doesn't assume that (for example) we can
> still fetch the name of an index whose OID appears in there someplace.

... like EXPLAIN, for example?

If "pruning" means physical removal from the plan tree, then it's
probably all right.  However, it looks to me like that doesn't
actually happen, or at least doesn't happen till much later, so
there's room for worry about a disconnect between what plancache.c
has verified and what executor startup will try to touch.  As you
say, in the absence of any bugs, that's not a problem ... but if
there are such bugs, tracking them down would be really hard.

What I am skeptical about is that this work actually accomplishes
anything under real-world conditions.  That's because if pruning would
save enough to make skipping the lock-acquisition phase worth the
trouble, the plan cache is almost certainly going to decide it should
be using a custom plan not a generic plan.  Now if we had a better
cost model (or, indeed, any model at all) for run-time pruning effects
then maybe that situation could be improved.  I think we'd be better
served to worry about that end of it before we spend more time making
the executor even less predictable.

Also, while I've not spent much time at all reading this patch,
it seems rather desperately undercommented, and a lot of the
new names are unintelligible.  In particular, I suspect that the
patch is significantly redesigning when/where run-time pruning
happens (unless it's just letting that be run twice); but I don't
see any documentation or name changes suggesting where that
responsibility is now.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... like EXPLAIN, for example?

Exactly! I think that's the foremost example, but extension modules
like auto_explain or even third-party extensions are also a risk. I
think there was some discussion of this previously.

> If "pruning" means physical removal from the plan tree, then it's
> probably all right.  However, it looks to me like that doesn't
> actually happen, or at least doesn't happen till much later, so
> there's room for worry about a disconnect between what plancache.c
> has verified and what executor startup will try to touch.  As you
> say, in the absence of any bugs, that's not a problem ... but if
> there are such bugs, tracking them down would be really hard.

Surgery on the plan would violate the general principle that plans are
read only once constructed. I think the idea ought to be to pass a
secondary data structure around with the plan that defines which parts
you must ignore. Any code that fails to use that other data structure
in the appropriate manner gets defined to be buggy and has to be fixed
by making it follow the new rules.

> What I am skeptical about is that this work actually accomplishes
> anything under real-world conditions.  That's because if pruning would
> save enough to make skipping the lock-acquisition phase worth the
> trouble, the plan cache is almost certainly going to decide it should
> be using a custom plan not a generic plan.  Now if we had a better
> cost model (or, indeed, any model at all) for run-time pruning effects
> then maybe that situation could be improved.  I think we'd be better
> served to worry about that end of it before we spend more time making
> the executor even less predictable.

I don't agree with that analysis, because setting plan_cache_mode is
not uncommon. Even if that GUC didn't exist, I'm pretty sure there are
cases where the planner naturally falls into a generic plan anyway,
even though pruning is happening. But as it is, the GUC does exist,
and people use it. Consequently, while I'd love to see something done
about the costing side of things, I do not accept that all other
improvements should wait for that to happen.

> Also, while I've not spent much time at all reading this patch,
> it seems rather desperately undercommented, and a lot of the
> new names are unintelligible.  In particular, I suspect that the
> patch is significantly redesigning when/where run-time pruning
> happens (unless it's just letting that be run twice); but I don't
> see any documentation or name changes suggesting where that
> responsibility is now.

I am sympathetic to that concern. I spent a while staring at a
baffling comment in 0001 only to discover it had just been moved from
elsewhere. I really don't feel that things in this are as clear as
they could be -- although I hasten to add that I respect the people
who have done work in this area previously and am grateful for what
they did. It's been a huge benefit to the project in spite of the
bumps in the road. Moreover, this isn't the only code in PostgreSQL
that needs improvement, or the worst. That said, I do think there are
problems. I don't yet have a position on whether this patch is making
that better or worse.

That said, I believe that the core idea of the patch is to optionally
perform pruning before we acquire locks or spin up the main executor
and then remember the decisions we made. If once the main executor is
spun up we already made those decisions, then we must stick with what
we decided. If not, we make those pruning decisions at the same point
we do currently - more or less on demand, at the point when we'd need
to know whether to descend that branch of the plan tree or not. I
think this scheme comes about because there are a couple of different
interfaces to the parameterized query stuff, and in some code paths we
have the values early enough to use them for pre-pruning, and in
others we don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > What I am skeptical about is that this work actually accomplishes
> > anything under real-world conditions.  That's because if pruning would
> > save enough to make skipping the lock-acquisition phase worth the
> > trouble, the plan cache is almost certainly going to decide it should
> > be using a custom plan not a generic plan.  Now if we had a better
> > cost model (or, indeed, any model at all) for run-time pruning effects
> > then maybe that situation could be improved.  I think we'd be better
> > served to worry about that end of it before we spend more time making
> > the executor even less predictable.
>
> I don't agree with that analysis, because setting plan_cache_mode is
> not uncommon. Even if that GUC didn't exist, I'm pretty sure there are
> cases where the planner naturally falls into a generic plan anyway,
> even though pruning is happening. But as it is, the GUC does exist,
> and people use it. Consequently, while I'd love to see something done
> about the costing side of things, I do not accept that all other
> improvements should wait for that to happen.

I agree that making generic plans execute faster has merit even before
we make the costing changes to allow plancache.c prefer generic plans
over custom ones in these cases.  As the numbers in my previous email
show, simply executing a generic plan with the proposed improvements
applied is significantly cheaper than having the planner do the
pruning on every execution:

nparts      auto/custom     generic
======      ==========      ======
32          13359           28204
64          15760           26795
128         15825           26387
256         15017           25601
512         13479           19911
1024        13200           20158
2048        12884           16180

> > Also, while I've not spent much time at all reading this patch,
> > it seems rather desperately undercommented, and a lot of the
> > new names are unintelligible.  In particular, I suspect that the
> > patch is significantly redesigning when/where run-time pruning
> > happens (unless it's just letting that be run twice); but I don't
> > see any documentation or name changes suggesting where that
> > responsibility is now.
>
> I am sympathetic to that concern. I spent a while staring at a
> baffling comment in 0001 only to discover it had just been moved from
> elsewhere. I really don't feel that things in this are as clear as
> they could be -- although I hasten to add that I respect the people
> who have done work in this area previously and am grateful for what
> they did. It's been a huge benefit to the project in spite of the
> bumps in the road. Moreover, this isn't the only code in PostgreSQL
> that needs improvement, or the worst. That said, I do think there are
> problems. I don't yet have a position on whether this patch is making
> that better or worse.

Okay, I'd like to post a new version with the comments edited to make
them a bit more intelligible.  I understand that the comments around
the new invocation mode(s) of runtime pruning are not as clear as they
should be, especially as the changes that this patch wants to make to
how things work are not very localized.

> That said, I believe that the core idea of the patch is to optionally
> perform pruning before we acquire locks or spin up the main executor
> and then remember the decisions we made. If once the main executor is
> spun up we already made those decisions, then we must stick with what
> we decided. If not, we make those pruning decisions at the same point
> we do currently

Right.  The "initial" pruning, that this patch wants to make occur at
an earlier point (plancache.c), is currently performed in
ExecInit[Merge]Append().

If it does occur early due to the plan being a cached one,
ExecInit[Merge]Append() simply refers to its result that would be made
available via a new data structure that plancache.c has been made to
pass down to the executor alongside the plan tree.

If it does not, ExecInit[Merge]Append() does the pruning in the same
way it does now.  Such cases include initial pruning using only STABLE
expressions that the planner doesn't bother to compute by itself lest
the resulting plan may be cached, but no EXTERN parameters.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Also, while I've not spent much time at all reading this patch,
> > > it seems rather desperately undercommented, and a lot of the
> > > new names are unintelligible.  In particular, I suspect that the
> > > patch is significantly redesigning when/where run-time pruning
> > > happens (unless it's just letting that be run twice); but I don't
> > > see any documentation or name changes suggesting where that
> > > responsibility is now.
> >
> > I am sympathetic to that concern. I spent a while staring at a
> > baffling comment in 0001 only to discover it had just been moved from
> > elsewhere. I really don't feel that things in this are as clear as
> > they could be -- although I hasten to add that I respect the people
> > who have done work in this area previously and am grateful for what
> > they did. It's been a huge benefit to the project in spite of the
> > bumps in the road. Moreover, this isn't the only code in PostgreSQL
> > that needs improvement, or the worst. That said, I do think there are
> > problems. I don't yet have a position on whether this patch is making
> > that better or worse.
>
> Okay, I'd like to post a new version with the comments edited to make
> them a bit more intelligible.  I understand that the comments around
> the new invocation mode(s) of runtime pruning are not as clear as they
> should be, especially as the changes that this patch wants to make to
> how things work are not very localized.

Actually, another area where the comments may not be as clear as they
should have been is the changes that the patch makes to the
AcquireExecutorLocks() logic that decides which relations are locked
to safeguard the plan tree for execution, which are those given by
RTE_RELATION entries in the range table.

Without the patch, they are found by actually scanning the range table.

With the patch, it's the same set of RTEs if the plan doesn't contain
any pruning nodes, though instead of the range table, what is scanned
is a bitmapset of their RT indexes that is made available by the
planner in the form of PlannedStmt.lockrels.  When the plan does
contain a pruning node (PlannedStmt.containsInitialPruning), the
bitmapset is constructed by calling ExecutorGetLockRels() on the plan
tree, which walks it to add RT indexes of relations mentioned in the
Scan nodes, while skipping any nodes that are pruned after performing
initial pruning steps that may be present in their containing parent
node's PartitionPruneInfo.  Also, the RT indexes of partitioned tables
that are present in the PartitionPruneInfo itself are also added to
the set.

While expanding comments added by the patch to make this clear, I
realized that there are two problems, one of them quite glaring:

* Planner's constructing this bitmapset and its copying along with the
PlannedStmt is pure overhead in the cases that this patch has nothing
to do with, which is the kind of thing that Andres cautioned against
upthread.

* Not all partitioned tables that would have been locked without the
patch to come up with a Append/MergeAppend plan may be returned by
ExecutorGetLockRels().  For example, if none of the query's
runtime-prunable quals were found to match the partition key of an
intermediate partitioned table and thus that partitioned table not
included in the PartitionPruneInfo.  Or if an Append/MergeAppend
covering a partition tree doesn't contain any PartitionPruneInfo to
begin with, in which case, only the leaf partitions and none of
partitioned parents would be accounted for by the
ExecutorGetLockRels() logic.

The 1st one seems easy to fix by not inventing PlannedStmt.lockrels
and just doing what's being done now: scan the range table if
(!PlannedStmt.containsInitialPruning).

The only way perhaps to fix the second one is to reconsider the
decision we made in the following commit:

    commit 52ed730d511b7b1147f2851a7295ef1fb5273776
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Sun Oct 7 14:33:17 2018 -0400

    Remove some unnecessary fields from Plan trees.

    In the wake of commit f2343653f, we no longer need some fields that
    were used before to control executor lock acquisitions:

    * PlannedStmt.nonleafResultRelations can go away entirely.

    * partitioned_rels can go away from Append, MergeAppend, and ModifyTable.
    However, ModifyTable still needs to know the RT index of the partition
    root table if any, which was formerly kept in the first entry of that
    list.  Add a new field "rootRelation" to remember that.  rootRelation is
    partly redundant with nominalRelation, in that if it's set it will have
    the same value as nominalRelation.  However, the latter field has a
    different purpose so it seems best to keep them distinct.

That is, add back the partitioned_rels field, at least to Append and
MergeAppend, to store the RT indexes of partitioned tables whose
children's paths are present in Append/MergeAppend.subpaths.

Thoughts?


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Mar 22, 2022 at 9:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 15, 2022 at 3:19 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Mar 15, 2022 at 5:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Mon, Mar 14, 2022 at 3:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > > Also, while I've not spent much time at all reading this patch,
> > > > it seems rather desperately undercommented, and a lot of the
> > > > new names are unintelligible.  In particular, I suspect that the
> > > > patch is significantly redesigning when/where run-time pruning
> > > > happens (unless it's just letting that be run twice); but I don't
> > > > see any documentation or name changes suggesting where that
> > > > responsibility is now.
> > >
> > > I am sympathetic to that concern. I spent a while staring at a
> > > baffling comment in 0001 only to discover it had just been moved from
> > > elsewhere. I really don't feel that things in this are as clear as
> > > they could be -- although I hasten to add that I respect the people
> > > who have done work in this area previously and am grateful for what
> > > they did. It's been a huge benefit to the project in spite of the
> > > bumps in the road. Moreover, this isn't the only code in PostgreSQL
> > > that needs improvement, or the worst. That said, I do think there are
> > > problems. I don't yet have a position on whether this patch is making
> > > that better or worse.
> >
> > Okay, I'd like to post a new version with the comments edited to make
> > them a bit more intelligible.  I understand that the comments around
> > the new invocation mode(s) of runtime pruning are not as clear as they
> > should be, especially as the changes that this patch wants to make to
> > how things work are not very localized.
>
> Actually, another area where the comments may not be as clear as they
> should have been is the changes that the patch makes to the
> AcquireExecutorLocks() logic that decides which relations are locked
> to safeguard the plan tree for execution, which are those given by
> RTE_RELATION entries in the range table.
>
> Without the patch, they are found by actually scanning the range table.
>
> With the patch, it's the same set of RTEs if the plan doesn't contain
> any pruning nodes, though instead of the range table, what is scanned
> is a bitmapset of their RT indexes that is made available by the
> planner in the form of PlannedStmt.lockrels.  When the plan does
> contain a pruning node (PlannedStmt.containsInitialPruning), the
> bitmapset is constructed by calling ExecutorGetLockRels() on the plan
> tree, which walks it to add RT indexes of relations mentioned in the
> Scan nodes, while skipping any nodes that are pruned after performing
> initial pruning steps that may be present in their containing parent
> node's PartitionPruneInfo.  Also, the RT indexes of partitioned tables
> that are present in the PartitionPruneInfo itself are also added to
> the set.
>
> While expanding comments added by the patch to make this clear, I
> realized that there are two problems, one of them quite glaring:
>
> * Planner's constructing this bitmapset and its copying along with the
> PlannedStmt is pure overhead in the cases that this patch has nothing
> to do with, which is the kind of thing that Andres cautioned against
> upthread.
>
> * Not all partitioned tables that would have been locked without the
> patch to come up with a Append/MergeAppend plan may be returned by
> ExecutorGetLockRels().  For example, if none of the query's
> runtime-prunable quals were found to match the partition key of an
> intermediate partitioned table and thus that partitioned table not
> included in the PartitionPruneInfo.  Or if an Append/MergeAppend
> covering a partition tree doesn't contain any PartitionPruneInfo to
> begin with, in which case, only the leaf partitions and none of
> partitioned parents would be accounted for by the
> ExecutorGetLockRels() logic.
>
> The 1st one seems easy to fix by not inventing PlannedStmt.lockrels
> and just doing what's being done now: scan the range table if
> (!PlannedStmt.containsInitialPruning).

The attached updated patch does it like this.

> The only way perhaps to fix the second one is to reconsider the
> decision we made in the following commit:
>
>     commit 52ed730d511b7b1147f2851a7295ef1fb5273776
>     Author: Tom Lane <tgl@sss.pgh.pa.us>
>     Date:   Sun Oct 7 14:33:17 2018 -0400
>
>     Remove some unnecessary fields from Plan trees.
>
>     In the wake of commit f2343653f, we no longer need some fields that
>     were used before to control executor lock acquisitions:
>
>     * PlannedStmt.nonleafResultRelations can go away entirely.
>
>     * partitioned_rels can go away from Append, MergeAppend, and ModifyTable.
>     However, ModifyTable still needs to know the RT index of the partition
>     root table if any, which was formerly kept in the first entry of that
>     list.  Add a new field "rootRelation" to remember that.  rootRelation is
>     partly redundant with nominalRelation, in that if it's set it will have
>     the same value as nominalRelation.  However, the latter field has a
>     different purpose so it seems best to keep them distinct.
>
> That is, add back the partitioned_rels field, at least to Append and
> MergeAppend, to store the RT indexes of partitioned tables whose
> children's paths are present in Append/MergeAppend.subpaths.

And implemented this in the attached 0002 that reintroduces
partitioned_rels in Append/MergeAppend nodes as a bitmapset of RT
indexes.  The set contains the RT indexes of partitioned ancestors
whose expansion produced the leaf partitions that a given
Append/MergeAppend node scans.   This project needs this way of
knowing the partitioned tables involved in producing an
Append/MergeAppend node, because we'd like to give plancache.c the
ability to glean the set of relations to be locked by scanning a plan
tree to make the tree ready for execution rather than by scanning the
range table and the only relations we're missing in the tree right now
are partitioned tables.

One fly-in-the-ointment situation I faced when doing that is the fact
that setrefs.c in most situations removes the Append/MergeAppend from
the final plan if it contains only one child subplan.  I got around it
by inventing a PlannerGlobal/PlannedStmt.elidedAppendPartedRels set
which is a union of partitioned_rels of all the Append/MergeAppend
nodes in the plan tree that were removed as described.

Other than the changes mentioned above, the updated patch now contains
a bit more commentary than earlier versions, mostly around
AcquireExecutorLocks()'s new way of determining the set of relations
to lock and the significantly redesigned working of the "initial"
execution pruning.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Other than the changes mentioned above, the updated patch now contains
> a bit more commentary than earlier versions, mostly around
> AcquireExecutorLocks()'s new way of determining the set of relations
> to lock and the significantly redesigned working of the "initial"
> execution pruning.

Forgot to rebase over the latest HEAD, so here's v7.  Also fixed that
_out and _read functions for PlanInitPruningOutput were using an
obsolete node label.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Mar 28, 2022 at 4:28 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Mar 28, 2022 at 4:17 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Other than the changes mentioned above, the updated patch now contains
> > a bit more commentary than earlier versions, mostly around
> > AcquireExecutorLocks()'s new way of determining the set of relations
> > to lock and the significantly redesigned working of the "initial"
> > execution pruning.
>
> Forgot to rebase over the latest HEAD, so here's v7.  Also fixed that
> _out and _read functions for PlanInitPruningOutput were using an
> obsolete node label.

Rebased.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
I'm looking at 0001 here with intention to commit later.  I see that
there is some resistance to 0004, but I think a final verdict on that
one doesn't materially affect 0001.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El destino baraja y nosotros jugamos" (A. Schopenhauer)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Mar 31, 2022 at 6:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I'm looking at 0001 here with intention to commit later.  I see that
> there is some resistance to 0004, but I think a final verdict on that
> one doesn't materially affect 0001.

Thanks.

While the main goal of the refactoring patch is to make it easier to
review the more complex changes that 0004 makes to execPartition.c, I
agree it has merit on its own.  Although, one may say that the bit
about providing a PlanState-independent ExprContext is more closely
tied with 0004's requirements...

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Thu, 31 Mar 2022 at 16:25, Amit Langote <amitlangote09@gmail.com> wrote:
> Rebased.

I've been looking over the v8 patch and I'd like to propose semi-baked
ideas to improve things.  I'd need to go and write them myself to
fully know if they'd actually work ok.

1. You've changed the signature of various functions by adding
ExecLockRelsInfo *execlockrelsinfo.  I'm wondering why you didn't just
put the ExecLockRelsInfo as a new field in PlannedStmt?

I think the above gets around messing the signatures of
CreateQueryDesc(), ExplainOnePlan(), pg_plan_queries(),
PortalDefineQuery(), ProcessQuery() It would get rid of your change of
foreach to forboth in execute_sql_string() / PortalRunMulti() and gets
rid of a number of places where your carrying around a variable named
execlockrelsinfo_list. It would also make the patch significantly
easier to review as you'd be touching far fewer files.

2. I don't really like the way you've gone about most of the patch...

The way I imagine this working is that during create_plan() we visit
all nodes that have run-time pruning then inside create_append_plan()
and create_merge_append_plan() we'd tag those onto a new field in
PlannerGlobal  That way you can store the PartitionPruneInfos in the
new PlannedStmt field in standard_planner() after the
makeNode(PlannedStmt).

Instead of storing the PartitionPruneInfo in the Append / MergeAppend
struct, you'd just add a new index field to those structs. The index
would start with 0 for the 0th PartitionPruneInfo. You'd basically
just know the index by assigning
list_length(root->glob->partitionpruneinfos).

You'd then assign the root->glob->partitionpruneinfos to
PlannedStmt.partitionpruneinfos and anytime you needed to do run-time
pruning during execution, you'd need to use the Append / MergeAppend's
partition_prune_info_idx to lookup the PartitionPruneInfo in some new
field you add to EState to store those.  You'd leave that index as -1
if there's no PartitionPruneInfo for the Append / MergeAppend node.

When you do AcquireExecutorLocks(), you'd iterate over the
PlannedStmt's PartitionPruneInfo to figure out which subplans to
prune. You'd then have an array sized
list_length(plannedstmt->runtimepruneinfos) where you'd store the
result.  When the Append/MergeAppend node starts up you just check if
the part_prune_info_idx >= 0 and if there's a non-NULL result stored
then use that result.  That's how you'd ensure you always got the same
run-time prune result between locking and plan startup.

3. Also, looking at ExecGetLockRels(), shouldn't it be the planner's
job to determine the minimum set of relations which must be locked?  I
think the plan tree traversal during execution not great.  Seems the
whole point of this patch is to reduce overhead during execution. A
full additional plan traversal aside from the 3 that we already do for
start/run/end of execution seems not great.

I think this means that during AcquireExecutorLocks() you'd start with
the minimum set or RTEs that need to be locked as determined during
create_plan() and stored in some Bitmapset field in PlannedStmt. This
minimal set would also only exclude RTIs that would only possibly be
used due to a PartitionPruneInfo with initial pruning steps, i.e.
include RTIs from PartitionPruneInfo with no init pruining steps (you
can't skip any locks for those).  All you need to do to determine the
RTEs to lock are to take the minimal set and execute each
PartitionPruneInfo in the PlannedStmt that has init steps

4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting
revived here.  Why don't you just add a partitioned_relids to
PartitionPruneInfo and just have make_partitionedrel_pruneinfo build
you a Relids of them. PartitionedRelPruneInfo already has an rtindex
field, so you just need to bms_add_member whatever that rtindex is.

It's a fairly high-level review at this stage. I can look in more
detail if the above points get looked at.  You may find or know of
some reason why it can't be done like I mention above.

David



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Thanks a lot for looking into this.

On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote:
> I've been looking over the v8 patch and I'd like to propose semi-baked
> ideas to improve things.  I'd need to go and write them myself to
> fully know if they'd actually work ok.
>
> 1. You've changed the signature of various functions by adding
> ExecLockRelsInfo *execlockrelsinfo.  I'm wondering why you didn't just
> put the ExecLockRelsInfo as a new field in PlannedStmt?
>
> I think the above gets around messing the signatures of
> CreateQueryDesc(), ExplainOnePlan(), pg_plan_queries(),
> PortalDefineQuery(), ProcessQuery() It would get rid of your change of
> foreach to forboth in execute_sql_string() / PortalRunMulti() and gets
> rid of a number of places where your carrying around a variable named
> execlockrelsinfo_list. It would also make the patch significantly
> easier to review as you'd be touching far fewer files.

I'm worried about that churn myself and did consider this idea, though
I couldn't shake the feeling that it's maybe wrong to put something in
PlannedStmt that the planner itself doesn't produce.  I mean the
definition of PlannedStmt says this:

/* ----------------
 *      PlannedStmt node
 *
 * The output of the planner

With the ideas that you've outlined below, perhaps we can frame most
of the things that the patch wants to do as the planner and the
plancache changes.  If we twist the above definition a bit to say what
the plancache does in this regard is part of planning, maybe it makes
sense to add the initial pruning related fields (nodes, outputs) into
PlannedStmt.

> 2. I don't really like the way you've gone about most of the patch...
>
> The way I imagine this working is that during create_plan() we visit
> all nodes that have run-time pruning then inside create_append_plan()
> and create_merge_append_plan() we'd tag those onto a new field in
> PlannerGlobal  That way you can store the PartitionPruneInfos in the
> new PlannedStmt field in standard_planner() after the
> makeNode(PlannedStmt).
>
> Instead of storing the PartitionPruneInfo in the Append / MergeAppend
> struct, you'd just add a new index field to those structs. The index
> would start with 0 for the 0th PartitionPruneInfo. You'd basically
> just know the index by assigning
> list_length(root->glob->partitionpruneinfos).
>
> You'd then assign the root->glob->partitionpruneinfos to
> PlannedStmt.partitionpruneinfos and anytime you needed to do run-time
> pruning during execution, you'd need to use the Append / MergeAppend's
> partition_prune_info_idx to lookup the PartitionPruneInfo in some new
> field you add to EState to store those.  You'd leave that index as -1
> if there's no PartitionPruneInfo for the Append / MergeAppend node.
>
> When you do AcquireExecutorLocks(), you'd iterate over the
> PlannedStmt's PartitionPruneInfo to figure out which subplans to
> prune. You'd then have an array sized
> list_length(plannedstmt->runtimepruneinfos) where you'd store the
> result.  When the Append/MergeAppend node starts up you just check if
> the part_prune_info_idx >= 0 and if there's a non-NULL result stored
> then use that result.  That's how you'd ensure you always got the same
> run-time prune result between locking and plan startup.

Actually, Robert too suggested such an idea to me off-list and I think
it's worth trying.  I was not sure about the implementation, because
then we'd be passing around lists of initial pruning nodes/results
across many function/module boundaries that you mentioned in your
comment 1, but if we agree that PlannedStmt is an acceptable place for
those things to be stored, then I agree it's an attractive idea.

> 3. Also, looking at ExecGetLockRels(), shouldn't it be the planner's
> job to determine the minimum set of relations which must be locked?  I
> think the plan tree traversal during execution not great.  Seems the
> whole point of this patch is to reduce overhead during execution. A
> full additional plan traversal aside from the 3 that we already do for
> start/run/end of execution seems not great.
>
> I think this means that during AcquireExecutorLocks() you'd start with
> the minimum set or RTEs that need to be locked as determined during
> create_plan() and stored in some Bitmapset field in PlannedStmt.

The patch did have a PlannedStmt.lockrels till v6.  Though, it wasn't
the same thing as you are describing it...

> This
> minimal set would also only exclude RTIs that would only possibly be
> used due to a PartitionPruneInfo with initial pruning steps, i.e.
> include RTIs from PartitionPruneInfo with no init pruining steps (you
> can't skip any locks for those).  All you need to do to determine the
> RTEs to lock are to take the minimal set and execute each
> PartitionPruneInfo in the PlannedStmt that has init steps

So just thinking about an Append/MergeAppend, the minimum set must
include the RT indexes of all the partitioned tables whose direct and
indirect children's plans will be in 'subplans' and also of the
children if the PartitionPruneInfo doesn't contain initial steps or if
there is no PartitionPruneInfo to begin with.

One question is whether the planner should always pay the overhead of
initializing this bitmapset?  I mean it's only worthwhile if
AcquireExecutorLocks() is going to be involved, that is, the plan will
be cached and reused.

> 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting
> revived here.  Why don't you just add a partitioned_relids to
> PartitionPruneInfo and just have make_partitionedrel_pruneinfo build
> you a Relids of them. PartitionedRelPruneInfo already has an rtindex
> field, so you just need to bms_add_member whatever that rtindex is.

Hmm, not all Append/MergeAppend nodes in the plan tree may have
make_partition_pruneinfo() called on them though.

If not the proposed RelOptInfo.partitioned_rels that is populated in
the early planning stages, the only reliable way to get all the
partitioned tables involved in Appends/MergeAppends at create_plan()
stage seems to be to make a function out the stanza at the top of
make_partition_pruneinfo() that collects them by scanning the leaf
paths and tracing each path's relation's parents up to the root
partitioned parent and call it from create_{merge_}append_plan() if
make_partition_pruneinfo() was not. I did try to implement that and
found it a bit complex and expensive (the scanning the leaf paths
part).

> It's a fairly high-level review at this stage. I can look in more
> detail if the above points get looked at.  You may find or know of
> some reason why it can't be done like I mention above.

I'll try to write a version with the above points addressed, while
keeping RelOptInfo.partitioned_rels around for now.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqH9-fAvpG-w9qYCcDWzK3vGPCMyw4f9nHzqkxXVuD1pxw%40mail.gmail.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> 1. You've changed the signature of various functions by adding
>> ExecLockRelsInfo *execlockrelsinfo.  I'm wondering why you didn't just
>> put the ExecLockRelsInfo as a new field in PlannedStmt?

> I'm worried about that churn myself and did consider this idea, though
> I couldn't shake the feeling that it's maybe wrong to put something in
> PlannedStmt that the planner itself doesn't produce.

PlannedStmt is part of the plan tree, which MUST be read-only to
the executor.  This is not negotiable.  However, there's other
places that this data could be put, such as QueryDesc.
Or for that matter, couldn't the data structure be created by
the planner?  (It looks like David is proposing exactly that
further down.)

            regards, tom lane



Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Fri, 1 Apr 2022 at 16:09, Amit Langote <amitlangote09@gmail.com> wrote:
> definition of PlannedStmt says this:
>
> /* ----------------
>  *      PlannedStmt node
>  *
>  * The output of the planner
>
> With the ideas that you've outlined below, perhaps we can frame most
> of the things that the patch wants to do as the planner and the
> plancache changes.  If we twist the above definition a bit to say what
> the plancache does in this regard is part of planning, maybe it makes
> sense to add the initial pruning related fields (nodes, outputs) into
> PlannedStmt.

How about the PartitionPruneInfos go into PlannedStmt as a List
indexed in the way I mentioned and the cache of the results of pruning
in EState?

I think that leaves you adding  List *partpruneinfos,  Bitmapset
*minimumlockrtis to PlannedStmt and the thing you have to cache the
pruning results into EState.   I'm not very clear on where you should
stash the results of run-time pruning in the meantime before you can
put them in EState.  You might need to invent some intermediate struct
that gets passed around that you can scribble down some details you're
going to need during execution.

> One question is whether the planner should always pay the overhead of
> initializing this bitmapset?  I mean it's only worthwhile if
> AcquireExecutorLocks() is going to be involved, that is, the plan will
> be cached and reused.

Maybe the Bitmapset for the minimal locks needs to be built with
bms_add_range(NULL, 0, list_length(rtable));  then do
bms_del_members() on the relevant RTIs you find in the listed
PartitionPruneInfos.  That way it's very simple and cheap to do when
there are no PartitionPruneInfos.

> > 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting
> > revived here.  Why don't you just add a partitioned_relids to
> > PartitionPruneInfo and just have make_partitionedrel_pruneinfo build
> > you a Relids of them. PartitionedRelPruneInfo already has an rtindex
> > field, so you just need to bms_add_member whatever that rtindex is.
>
> Hmm, not all Append/MergeAppend nodes in the plan tree may have
> make_partition_pruneinfo() called on them though.

For Append/MergeAppends without run-time pruning you'll want to add
the RTIs to the minimal locking set of RTIs to go into PlannedStmt.
The only things you want to leave out of that are RTIs for the RTEs
that you might run-time prune away during AcquireExecutorLocks().

David



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Apr 1, 2022 at 1:08 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 1 Apr 2022 at 16:09, Amit Langote <amitlangote09@gmail.com> wrote:
> > definition of PlannedStmt says this:
> >
> > /* ----------------
> >  *      PlannedStmt node
> >  *
> >  * The output of the planner
> >
> > With the ideas that you've outlined below, perhaps we can frame most
> > of the things that the patch wants to do as the planner and the
> > plancache changes.  If we twist the above definition a bit to say what
> > the plancache does in this regard is part of planning, maybe it makes
> > sense to add the initial pruning related fields (nodes, outputs) into
> > PlannedStmt.
>
> How about the PartitionPruneInfos go into PlannedStmt as a List
> indexed in the way I mentioned and the cache of the results of pruning
> in EState?
>
> I think that leaves you adding  List *partpruneinfos,  Bitmapset
> *minimumlockrtis to PlannedStmt and the thing you have to cache the
> pruning results into EState.   I'm not very clear on where you should
> stash the results of run-time pruning in the meantime before you can
> put them in EState.  You might need to invent some intermediate struct
> that gets passed around that you can scribble down some details you're
> going to need during execution.

Yes, the ExecLockRelsInfo node in the current patch, that first gets
added to the QueryDesc and subsequently to the EState of the query,
serves as that stashing place.  Not sure if you've looked at
ExecLockRelInfo in detail in your review of the patch so far, but it
carries the initial pruning result in what are called
PlanInitPruningOutput nodes, which are stored in a list in
ExecLockRelsInfo and their offsets in the list are in turn stored in
an adjacent array that contains an element for every plan node in the
tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
don't need to have that array, because the Append/MergeAppend nodes
would be carrying those offsets by themselves.

Maybe a different name for ExecLockRelsInfo would be better?

Also, given Tom's apparent dislike for carrying that in PlannedStmt,
maybe the way I have it now is fine?

> > One question is whether the planner should always pay the overhead of
> > initializing this bitmapset?  I mean it's only worthwhile if
> > AcquireExecutorLocks() is going to be involved, that is, the plan will
> > be cached and reused.
>
> Maybe the Bitmapset for the minimal locks needs to be built with
> bms_add_range(NULL, 0, list_length(rtable));  then do
> bms_del_members() on the relevant RTIs you find in the listed
> PartitionPruneInfos.  That way it's very simple and cheap to do when
> there are no PartitionPruneInfos.

Ah, okay.  Looking at make_partition_pruneinfo(), I think I see a way
to delete the RTIs of prunable relations -- construct a
all_matched_leaf_part_relids in parallel to allmatchedsubplans and
delete those from the initial set.

> > > 4. It's a bit disappointing to see RelOptInfo.partitioned_rels getting
> > > revived here.  Why don't you just add a partitioned_relids to
> > > PartitionPruneInfo and just have make_partitionedrel_pruneinfo build
> > > you a Relids of them. PartitionedRelPruneInfo already has an rtindex
> > > field, so you just need to bms_add_member whatever that rtindex is.
> >
> > Hmm, not all Append/MergeAppend nodes in the plan tree may have
> > make_partition_pruneinfo() called on them though.
>
> For Append/MergeAppends without run-time pruning you'll want to add
> the RTIs to the minimal locking set of RTIs to go into PlannedStmt.
> The only things you want to leave out of that are RTIs for the RTEs
> that you might run-time prune away during AcquireExecutorLocks().

Yeah, I see it now.

Thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Apr 1, 2022 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Fri, Apr 1, 2022 at 10:32 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >> 1. You've changed the signature of various functions by adding
> >> ExecLockRelsInfo *execlockrelsinfo.  I'm wondering why you didn't just
> >> put the ExecLockRelsInfo as a new field in PlannedStmt?
>
> > I'm worried about that churn myself and did consider this idea, though
> > I couldn't shake the feeling that it's maybe wrong to put something in
> > PlannedStmt that the planner itself doesn't produce.
>
> PlannedStmt is part of the plan tree, which MUST be read-only to
> the executor.  This is not negotiable.  However, there's other
> places that this data could be put, such as QueryDesc.
> Or for that matter, couldn't the data structure be created by
> the planner?  (It looks like David is proposing exactly that
> further down.)

The data structure in question is for storing the results of
performing initial partition pruning on a generic plan, which the
proposes to do in plancache.c -- inside the body of
AcquireExecutorLocks()'s loop over PlannedStmts -- so, it's hard to
see it as a product of the planner. :-(

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote:
> Yes, the ExecLockRelsInfo node in the current patch, that first gets
> added to the QueryDesc and subsequently to the EState of the query,
> serves as that stashing place.  Not sure if you've looked at
> ExecLockRelInfo in detail in your review of the patch so far, but it
> carries the initial pruning result in what are called
> PlanInitPruningOutput nodes, which are stored in a list in
> ExecLockRelsInfo and their offsets in the list are in turn stored in
> an adjacent array that contains an element for every plan node in the
> tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> don't need to have that array, because the Append/MergeAppend nodes
> would be carrying those offsets by themselves.

I saw it, just not in great detail. I saw that you had an array that
was indexed by the plan node's ID.  I thought that wouldn't be so good
with large complex plans that we often get with partitioning
workloads.  That's why I mentioned using another index that you store
in Append/MergeAppend that starts at 0 and increments by 1 for each
node that has a PartitionPruneInfo made for it during create_plan.

> Maybe a different name for ExecLockRelsInfo would be better?
>
> Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> maybe the way I have it now is fine?

I think if you change how it's indexed and the other stuff then we can
have another look.  I think the patch will be much easier to review
once the ParitionPruneInfos are moved into PlannedStmt.

David



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Apr 1, 2022 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote:
> > Yes, the ExecLockRelsInfo node in the current patch, that first gets
> > added to the QueryDesc and subsequently to the EState of the query,
> > serves as that stashing place.  Not sure if you've looked at
> > ExecLockRelInfo in detail in your review of the patch so far, but it
> > carries the initial pruning result in what are called
> > PlanInitPruningOutput nodes, which are stored in a list in
> > ExecLockRelsInfo and their offsets in the list are in turn stored in
> > an adjacent array that contains an element for every plan node in the
> > tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> > don't need to have that array, because the Append/MergeAppend nodes
> > would be carrying those offsets by themselves.
>
> I saw it, just not in great detail. I saw that you had an array that
> was indexed by the plan node's ID.  I thought that wouldn't be so good
> with large complex plans that we often get with partitioning
> workloads.  That's why I mentioned using another index that you store
> in Append/MergeAppend that starts at 0 and increments by 1 for each
> node that has a PartitionPruneInfo made for it during create_plan.
>
> > Maybe a different name for ExecLockRelsInfo would be better?
> >
> > Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> > maybe the way I have it now is fine?
>
> I think if you change how it's indexed and the other stuff then we can
> have another look.  I think the patch will be much easier to review
> once the ParitionPruneInfos are moved into PlannedStmt.

Will do, thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
I noticed a definitional problem in 0001 that's also a bug in some
conditions -- namely that the bitmapset "validplans" is never explicitly
initialized to NIL.  In the original coding, the BMS was always returned
from somewhere; in the new code, it is passed from an uninitialized
stack variable into the new ExecInitPartitionPruning function, which
then proceeds to add new members to it without initializing it first.
Indeed that function's header comment explicitly indicates that it is
not initialized:

+ * Initial pruning can be done immediately, so it is done here if needed and
+ * the set of surviving partition subplans' indexes are added to the output
+ * parameter *initially_valid_subplans.

even though this is not fully correct, because when prunestate->do_initial_prune
is false, then the BMS *is* initialized.

I have no opinion on where to initialize it, but it needs to be done
somewhere and the comment needs to agree.


I think the names ExecCreatePartitionPruneState and
ExecInitPartitionPruning are too confusingly similar.  Maybe the former
should be renamed to somehow make it clear that it is a subroutine for
the former.


At the top of the file, there's a new comment that reads:

  * ExecInitPartitionPruning:
  *     Creates the PartitionPruneState required by each of the two pruning
  *     functions.

What are "the two pruning functions"?  I think here you mean "Append"
and "MergeAppend".  Maybe spell that out explicitly.


I think this comment needs to be reworded:

+ * Subplans would previously be indexed 0..(n_total_subplans - 1) should be
+ * changed to index range 0..num(initially_valid_subplans).

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Thanks for the review.

On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I noticed a definitional problem in 0001 that's also a bug in some
> conditions -- namely that the bitmapset "validplans" is never explicitly
> initialized to NIL.  In the original coding, the BMS was always returned
> from somewhere; in the new code, it is passed from an uninitialized
> stack variable into the new ExecInitPartitionPruning function, which
> then proceeds to add new members to it without initializing it first.

Hmm, the following blocks in ExecInitPartitionPruning() define
*initially_valid_subplans:

    /*
     * Perform an initial partition prune pass, if required.
     */
    if (prunestate->do_initial_prune)
    {
        /* Determine which subplans survive initial pruning */
        *initially_valid_subplans = ExecFindInitialMatchingSubPlans(prunestate);
    }
    else
    {
        /* We'll need to initialize all subplans */
        Assert(n_total_subplans > 0);
        *initially_valid_subplans = bms_add_range(NULL, 0,
                                                  n_total_subplans - 1);
    }

AFAICS, both assign *initially_valid_subplans a value whose
computation is not dependent on reading it first, so I don't see a
problem.

Am I missing something?

> Indeed that function's header comment explicitly indicates that it is
> not initialized:
>
> + * Initial pruning can be done immediately, so it is done here if needed and
> + * the set of surviving partition subplans' indexes are added to the output
> + * parameter *initially_valid_subplans.
>
> even though this is not fully correct, because when prunestate->do_initial_prune
> is false, then the BMS *is* initialized.
>
> I have no opinion on where to initialize it, but it needs to be done
> somewhere and the comment needs to agree.

I can see that the comment is insufficient, so I've expanded it as follows:

- * Initial pruning can be done immediately, so it is done here if needed and
- * the set of surviving partition subplans' indexes are added to the output
- * parameter *initially_valid_subplans.
+ * On return, *initially_valid_subplans is assigned the set of indexes of
+ * child subplans that must be initialized along with the parent plan node.
+ * Initial pruning is performed here if needed and in that case only the
+ * surviving subplans' indexes are added.

> I think the names ExecCreatePartitionPruneState and
> ExecInitPartitionPruning are too confusingly similar.  Maybe the former
> should be renamed to somehow make it clear that it is a subroutine for
> the former.

Ah, yes.  I've taken out the "Exec" from the former.

> At the top of the file, there's a new comment that reads:
>
>   * ExecInitPartitionPruning:
>   *     Creates the PartitionPruneState required by each of the two pruning
>   *     functions.
>
> What are "the two pruning functions"?  I think here you mean "Append"
> and "MergeAppend".  Maybe spell that out explicitly.

Actually it meant: ExecFindInitiaMatchingSubPlans() and
ExecFindMatchingSubPlans().  They perform "initial" and "exec" set of
pruning steps, respectively.

I realized that both functions have identical bodies at this point,
except that they pass 'true' and 'false', respectively, for
initial_prune argument of the sub-routine
find_matching_subplans_recurse(), which is where the pruning using the
appropriate set of steps contained in PartitionPruneState
(initial_pruning_steps or exec_pruning_steps) actually occurs.  So,
I've updated the patch to just retain the latter, adding an
initial_prune parameter to it to pass to the aforementioned
find_matching_subplans_recurse().

I've also updated the run-time pruning module comment to describe this change:

  * ExecFindMatchingSubPlans:
- *     Returns indexes of matching subplans after evaluating all available
- *     expressions, that is, using execution pruning steps.  This function can
- *     can only be called during execution and must be called again each time
- *     the value of a Param listed in PartitionPruneState's 'execparamids'
- *     changes.
+ *     Returns indexes of matching subplans after evaluating the expressions
+ *     that are safe to evaluate at a given point.  This function is first
+ *     called during ExecInitPartitionPruning() to find the initially
+ *     matching subplans based on performing the initial pruning steps and
+ *     then must be called again each time the value of a Param listed in
+ *     PartitionPruneState's 'execparamids' changes.

> I think this comment needs to be reworded:
>
> + * Subplans would previously be indexed 0..(n_total_subplans - 1) should be
> + * changed to index range 0..num(initially_valid_subplans).

Assuming you meant to ask to write this without the odd notation, I've
expanded the comment as follows:

- * Subplans would previously be indexed 0..(n_total_subplans - 1) should be
- * changed to index range 0..num(initially_valid_subplans).
+ * Current values of the indexes present in PartitionPruneState count all the
+ * subplans that would be present before initial pruning was done.  If initial
+ * pruning got rid of some of the subplans, any subsequent pruning passes will
+ * will be looking at a different set of target subplans to choose from than
+ * those in the pre-initial-pruning set, so the maps in PartitionPruneState
+ * containing those indexes must be updated to reflect the new indexes of
+ * subplans in the post-initial-pruning set.

I've attached only the updated 0001, though I'm still working on the
others to address David's comments.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Apr 4, 2022 at 9:55 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I think the names ExecCreatePartitionPruneState and
> > ExecInitPartitionPruning are too confusingly similar.  Maybe the former
> > should be renamed to somehow make it clear that it is a subroutine for
> > the former.
>
> Ah, yes.  I've taken out the "Exec" from the former.

While at it, maybe it's better to rename ExecInitPruningContext() to
InitPartitionPruneContext(), which I've done in the attached updated
patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Apr-05, Amit Langote wrote:

> While at it, maybe it's better to rename ExecInitPruningContext() to
> InitPartitionPruneContext(), which I've done in the attached updated
> patch.

Good call.  I had changed that name too, but yours seems a better
choice.

I made a few other cosmetic changes and pushed.  I'm afraid this will
cause a few conflicts with your 0004 -- hopefully these should mostly be
minor.

One change that's not completely cosmetic is a change in the test on
whether to call PartitionPruneFixSubPlanMap or not.  Originally it was:

if (partprune->do_exec_prune &&
    bms_num_members( ... ))
        do_stuff();

which meant that bms_num_members() is only evaluated if do_exec_prune.
However, the do_exec_prune bit is an optimization (we can skip doing
that stuff if it's not going to be used), but the other test is more
strict: the stuff is completely irrelevant if no plans have been
removed, since the data structure does not need fixing.  So I changed it
to be like this

if (bms_num_members( .. ))
{
    /* can skip if it's pointless */
    if (do_exec_prune)
        do_stuff();
}

I think that it is clearer to the human reader this way; and I think a
smart compiler may realize that the test can be reversed and avoid
counting bits when it's pointless.

So your 0004 patch should add the new condition to the outer if(), since
it's a critical consideration rather than an optimization:
if (partprune && bms_num_members())
{
    /* can skip if pointless */
    if (do_exec_prune)
        do_stuff()
}

Now, if we disagree and think that counting bits in the BMS when it's
going to be discarded by do_exec_prune being false, then we can flip
that back as originally and a more explicit comment.  With no evidence,
I doubt it matters.

Thanks for the patch!  I think the new coding is indeed a bit easier to
follow.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",
       more or less, right?
<crab> i.e., "deadly poison"



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Apr 5, 2022 at 7:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Apr-05, Amit Langote wrote:
> > While at it, maybe it's better to rename ExecInitPruningContext() to
> > InitPartitionPruneContext(), which I've done in the attached updated
> > patch.
>
> Good call.  I had changed that name too, but yours seems a better
> choice.
>
> I made a few other cosmetic changes and pushed.

Thanks!

>  I'm afraid this will
> cause a few conflicts with your 0004 -- hopefully these should mostly be
> minor.
>
> One change that's not completely cosmetic is a change in the test on
> whether to call PartitionPruneFixSubPlanMap or not.  Originally it was:
>
> if (partprune->do_exec_prune &&
>     bms_num_members( ... ))
>         do_stuff();
>
> which meant that bms_num_members() is only evaluated if do_exec_prune.
> However, the do_exec_prune bit is an optimization (we can skip doing
> that stuff if it's not going to be used), but the other test is more
> strict: the stuff is completely irrelevant if no plans have been
> removed, since the data structure does not need fixing.  So I changed it
> to be like this
>
> if (bms_num_members( .. ))
> {
>         /* can skip if it's pointless */
>         if (do_exec_prune)
>                 do_stuff();
> }
>
> I think that it is clearer to the human reader this way; and I think a
> smart compiler may realize that the test can be reversed and avoid
> counting bits when it's pointless.
>
> So your 0004 patch should add the new condition to the outer if(), since
> it's a critical consideration rather than an optimization:
> if (partprune && bms_num_members())
> {
>         /* can skip if pointless */
>         if (do_exec_prune)
>                 do_stuff()
> }
>
> Now, if we disagree and think that counting bits in the BMS when it's
> going to be discarded by do_exec_prune being false, then we can flip
> that back as originally and a more explicit comment.  With no evidence,
> I doubt it matters.

I agree that counting bits in the outer condition makes this easier to
read, so see no problem with keeping it that way.

Will post the rebased main patch soon, whose rewrite I'm close to
being done with.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Apr 1, 2022 at 5:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Apr 1, 2022 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > On Fri, 1 Apr 2022 at 19:58, Amit Langote <amitlangote09@gmail.com> wrote:
> > > Yes, the ExecLockRelsInfo node in the current patch, that first gets
> > > added to the QueryDesc and subsequently to the EState of the query,
> > > serves as that stashing place.  Not sure if you've looked at
> > > ExecLockRelInfo in detail in your review of the patch so far, but it
> > > carries the initial pruning result in what are called
> > > PlanInitPruningOutput nodes, which are stored in a list in
> > > ExecLockRelsInfo and their offsets in the list are in turn stored in
> > > an adjacent array that contains an element for every plan node in the
> > > tree.  If we go with a PlannedStmt.partpruneinfos list, then maybe we
> > > don't need to have that array, because the Append/MergeAppend nodes
> > > would be carrying those offsets by themselves.
> >
> > I saw it, just not in great detail. I saw that you had an array that
> > was indexed by the plan node's ID.  I thought that wouldn't be so good
> > with large complex plans that we often get with partitioning
> > workloads.  That's why I mentioned using another index that you store
> > in Append/MergeAppend that starts at 0 and increments by 1 for each
> > node that has a PartitionPruneInfo made for it during create_plan.
> >
> > > Maybe a different name for ExecLockRelsInfo would be better?
> > >
> > > Also, given Tom's apparent dislike for carrying that in PlannedStmt,
> > > maybe the way I have it now is fine?
> >
> > I think if you change how it's indexed and the other stuff then we can
> > have another look.  I think the patch will be much easier to review
> > once the ParitionPruneInfos are moved into PlannedStmt.
>
> Will do, thanks.

And here is a version like that that passes make check-world.  Maybe
still a WIP as I think comments could use more editing.

Here's how the new implementation works:

AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn
iterates over a list of PartitionPruneInfos in a given PlannedStmt
coming from a CachedPlan.  For each PartitionPruneInfo,
ExecPartitionDoInitialPruning() is called, which sets up
PartitionPruneState and performs initial pruning steps present in the
PartitionPruneInfo.  The resulting bitmapsets of valid subplans, one
for each PartitionPruneInfo, are collected in a list and added to a
result node called PartitionPruneResult.  It represents the result of
performing initial pruning on all PartitionPruneInfos found in a plan.
A list of PartitionPruneResults is passed along with the PlannedStmt
to the executor, which is referenced when initializing
Append/MergeAppend nodes.

PlannedStmt.minLockRelids defined by the planner contains the RT
indexes of all the entries in the range table minus those of the leaf
partitions whose subplans are subject to removal due to initial
pruning.  AcquireExecutoLocks() adds back the RT indexes of only those
leaf partitions whose subplans survive ExecutorDoInitialPruning().  To
get the leaf partition RT indexes from the PartitionPruneInfo, a new
rti_map array is added to PartitionedRelPruneInfo.

There's only one patch this time.  Patches that added partitioned_rels
and plan_tree_walker() are no longer necessary.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Apr 6, 2022 at 4:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> And here is a version like that that passes make check-world.  Maybe
> still a WIP as I think comments could use more editing.
>
> Here's how the new implementation works:
>
> AcquireExecutorLocks() calls ExecutorDoInitialPruning(), which in turn
> iterates over a list of PartitionPruneInfos in a given PlannedStmt
> coming from a CachedPlan.  For each PartitionPruneInfo,
> ExecPartitionDoInitialPruning() is called, which sets up
> PartitionPruneState and performs initial pruning steps present in the
> PartitionPruneInfo.  The resulting bitmapsets of valid subplans, one
> for each PartitionPruneInfo, are collected in a list and added to a
> result node called PartitionPruneResult.  It represents the result of
> performing initial pruning on all PartitionPruneInfos found in a plan.
> A list of PartitionPruneResults is passed along with the PlannedStmt
> to the executor, which is referenced when initializing
> Append/MergeAppend nodes.
>
> PlannedStmt.minLockRelids defined by the planner contains the RT
> indexes of all the entries in the range table minus those of the leaf
> partitions whose subplans are subject to removal due to initial
> pruning.  AcquireExecutoLocks() adds back the RT indexes of only those
> leaf partitions whose subplans survive ExecutorDoInitialPruning().  To
> get the leaf partition RT indexes from the PartitionPruneInfo, a new
> rti_map array is added to PartitionedRelPruneInfo.
>
> There's only one patch this time.  Patches that added partitioned_rels
> and plan_tree_walker() are no longer necessary.

Here's an updated version.  In Particular, I removed
part_prune_results list from PortalData, in favor of anything that
needs to look at the list can instead get it from the CachedPlan
(PortalData.cplan).  This makes things better in 2 ways:

* All the changes that were needed to produce the list to be pass to
PortalDefineQuery() are now unnecessary (especially ugly ones were
those made to pg_plan_queries()'s interface)

* The cases in which the PartitionPruneResult being added to a
QueryDesc can be assumed to be valid is more clearly define now; it's
the cases where the portal's CachedPlan is also valid, that is, if the
accompanying PlannedStmt is a cached one.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote:
> Here's an updated version.  In Particular, I removed
> part_prune_results list from PortalData, in favor of anything that
> needs to look at the list can instead get it from the CachedPlan
> (PortalData.cplan).  This makes things better in 2 ways:

Thanks for making those changes.

I'm not overly familiar with the data structures we use for planning
around plans between the planner and executor, but storing the pruning
results in CachedPlan seems pretty bad. I see you've stashed it in
there and invented a new memory context to stop leaks into the cache
memory.

Since I'm not overly familiar with these structures, I'm trying to
imagine why you made that choice and the best I can come up with was
that it was the most convenient thing you had to hand inside
CheckCachedPlan().

I don't really have any great ideas right now on how to make this
better. I wonder if GetCachedPlan() should be changed to return some
struct that wraps up the CachedPlan with some sort of executor prep
info struct that we can stash the list of PartitionPruneResults in,
and perhaps something else one day.

Some lesser important stuff that I think could be done better.

* Are you also able to put meaningful comments on the
PartitionPruneResult struct in execnodes.h?

* In create_append_plan() and create_merge_append_plan() you have the
same code to set the part_prune_index. Why not just move all that code
into make_partition_pruneinfo() and have make_partition_pruneinfo()
return the index and append to the PlannerInfo.partPruneInfos List?

* Why not forboth() here?

i = 0;
foreach(stmtlist_item, portal->stmts)
{
PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item);
PartitionPruneResult *part_prune_result = part_prune_results ?
  list_nth(part_prune_results, i) :
  NULL;

i++;

* It would be good if ReleaseExecutorLocks() already knew the RTIs
that were locked. Maybe the executor prep info struct I mentioned
above could also store the RTIs that have been locked already and
allow ReleaseExecutorLocks() to just iterate over those to release the
locks.

David



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Apr 7, 2022 at 9:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Thu, 7 Apr 2022 at 20:28, Amit Langote <amitlangote09@gmail.com> wrote:
> > Here's an updated version.  In Particular, I removed
> > part_prune_results list from PortalData, in favor of anything that
> > needs to look at the list can instead get it from the CachedPlan
> > (PortalData.cplan).  This makes things better in 2 ways:
>
> Thanks for making those changes.
>
> I'm not overly familiar with the data structures we use for planning
> around plans between the planner and executor, but storing the pruning
> results in CachedPlan seems pretty bad. I see you've stashed it in
> there and invented a new memory context to stop leaks into the cache
> memory.
>
> Since I'm not overly familiar with these structures, I'm trying to
> imagine why you made that choice and the best I can come up with was
> that it was the most convenient thing you had to hand inside
> CheckCachedPlan().

Yeah, it's that way because it felt convenient, though I have wondered
if a simpler scheme that doesn't require any changes to the CachedPlan
data structure might be better after all.  Your pointing it out has
made me think a bit harder on that.

> I don't really have any great ideas right now on how to make this
> better. I wonder if GetCachedPlan() should be changed to return some
> struct that wraps up the CachedPlan with some sort of executor prep
> info struct that we can stash the list of PartitionPruneResults in,
> and perhaps something else one day.

I think what might be better to do now is just add an output List
parameter to GetCachedPlan() to add the PartitionPruneResult node to
instead of stashing them into CachedPlan as now.  IMHO, we should
leave inventing a new generic struct to the next project that will
make it necessary to return more information from GetCachedPlan() to
its users.  I find it hard to convincingly describe what the new
generic struct really is if we invent it *now*, when it's going to
carry a single list whose purpose is pretty narrow.

So, I've implemented this by making the callers of GetCachedPlan()
pass a list to add the PartitionPruneResults that may be produced.
Most callers can put that into the Portal for passing that to other
modules, so I have reinstated PortalData.part_prune_results.  As for
its memory management, the list and the PartitionPruneResults therein
will be allocated in a context that holds the Portal itself.

> Some lesser important stuff that I think could be done better.
>
> * Are you also able to put meaningful comments on the
> PartitionPruneResult struct in execnodes.h?
>
> * In create_append_plan() and create_merge_append_plan() you have the
> same code to set the part_prune_index. Why not just move all that code
> into make_partition_pruneinfo() and have make_partition_pruneinfo()
> return the index and append to the PlannerInfo.partPruneInfos List?

That sounds better, so done.

> * Why not forboth() here?
>
> i = 0;
> foreach(stmtlist_item, portal->stmts)
> {
> PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item);
> PartitionPruneResult *part_prune_result = part_prune_results ?
>   list_nth(part_prune_results, i) :
>   NULL;
>
> i++;

Because the PartitionPruneResult list may not always be available.  To
wit, it's only available when it is GetCachedPlan() that gave the
portal its plan.  I know this is a bit ugly, but it seems better than
fixing all users of Portal to build a dummy list, not that it is
totally avoidable even in the current implementation.

> * It would be good if ReleaseExecutorLocks() already knew the RTIs
> that were locked. Maybe the executor prep info struct I mentioned
> above could also store the RTIs that have been locked already and
> allow ReleaseExecutorLocks() to just iterate over those to release the
> locks.

Rewrote this such that ReleaseExecutorLocks() just receives a list of
per-PlannedStmt bitmapsets containing the RT indexes of only the
locked entries in that plan.

Attached updated patch with these changes.



--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote:
> Attached updated patch with these changes.

Thanks for making the changes.  I started looking over this patch but
really feel like it needs quite a few more iterations of what we've
just been doing to get it into proper committable shape. There seems
to be only about 40 mins to go before the freeze, so it seems very
unrealistic that it could be made to work.

I started trying to take a serious look at it this evening, but I feel
like I just failed to get into it deep enough to make any meaningful
improvements.  I'd need more time to study the problem before I could
build up a proper opinion on how exactly I think it should work.

Anyway. I've attached a small patch that's just a few things I
adjusted or questions while reading over your v13 patch.  Some of
these are just me questioning your code (See XXX comments) and some I
think are improvements. Feel free to take the hunks that you see fit
and drop anything you don't.

David

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi David,

On Fri, Apr 8, 2022 at 8:16 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 8 Apr 2022 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached updated patch with these changes.
> Thanks for making the changes.  I started looking over this patch but
> really feel like it needs quite a few more iterations of what we've
> just been doing to get it into proper committable shape. There seems
> to be only about 40 mins to go before the freeze, so it seems very
> unrealistic that it could be made to work.

Yeah, totally understandable.

> I started trying to take a serious look at it this evening, but I feel
> like I just failed to get into it deep enough to make any meaningful
> improvements.  I'd need more time to study the problem before I could
> build up a proper opinion on how exactly I think it should work.
>
> Anyway. I've attached a small patch that's just a few things I
> adjusted or questions while reading over your v13 patch.  Some of
> these are just me questioning your code (See XXX comments) and some I
> think are improvements. Feel free to take the hunks that you see fit
> and drop anything you don't.

Thanks a lot for compiling those.

Most looked fine changes to me except a couple of typos, so I've
adopted those into the attached new version, even though I know it's
too late to try to apply it.  Re the XXX comments:

+ /* XXX why would pprune->rti_map[i] ever be zero here??? */

Yeah, no there can't be, was perhaps being overly paraioid.

+ * XXX is it worth doing a bms_copy() on glob->minLockRelids if
+ * glob->containsInitialPruning is true?. I'm slighly worried that the
+ * Bitmapset could have a very long empty tail resulting in excessive
+ * looping during AcquireExecutorLocks().
+ */

I guess I trust your instincts about bitmapset operation efficiency
and what you've written here makes sense.  It's typical for leaf
partitions to have been appended toward the tail end of rtable and I'd
imagine their indexes would be in the tail words of minLockRelids.  If
copying the bitmapset removes those useless words, I don't see why we
shouldn't do that.  So added:

+ /*
+ * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
+ * bit from it just above to prevent empty tail bits resulting in
+ * inefficient looping during AcquireExecutorLocks().
+ */
+ if (glob->containsInitialPruning)
+ glob->minLockRelids = bms_copy(glob->minLockRelids)

Not 100% about the comment I wrote.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Most looked fine changes to me except a couple of typos, so I've
> adopted those into the attached new version, even though I know it's
> too late to try to apply it.
>
> + * XXX is it worth doing a bms_copy() on glob->minLockRelids if
> + * glob->containsInitialPruning is true?. I'm slighly worried that the
> + * Bitmapset could have a very long empty tail resulting in excessive
> + * looping during AcquireExecutorLocks().
> + */
>
> I guess I trust your instincts about bitmapset operation efficiency
> and what you've written here makes sense.  It's typical for leaf
> partitions to have been appended toward the tail end of rtable and I'd
> imagine their indexes would be in the tail words of minLockRelids.  If
> copying the bitmapset removes those useless words, I don't see why we
> shouldn't do that.  So added:
>
> + /*
> + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
> + * bit from it just above to prevent empty tail bits resulting in
> + * inefficient looping during AcquireExecutorLocks().
> + */
> + if (glob->containsInitialPruning)
> + glob->minLockRelids = bms_copy(glob->minLockRelids)
>
> Not 100% about the comment I wrote.

And the quoted code change missed a semicolon in the v14 that I
hurriedly sent on Friday.   (Had apparently forgotten to `git add` the
hunk to fix that).

Sending v15 that fixes that to keep the cfbot green for now.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Zhihong Yu
Date:


On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, Apr 8, 2022 at 8:45 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Most looked fine changes to me except a couple of typos, so I've
> adopted those into the attached new version, even though I know it's
> too late to try to apply it.
>
> + * XXX is it worth doing a bms_copy() on glob->minLockRelids if
> + * glob->containsInitialPruning is true?. I'm slighly worried that the
> + * Bitmapset could have a very long empty tail resulting in excessive
> + * looping during AcquireExecutorLocks().
> + */
>
> I guess I trust your instincts about bitmapset operation efficiency
> and what you've written here makes sense.  It's typical for leaf
> partitions to have been appended toward the tail end of rtable and I'd
> imagine their indexes would be in the tail words of minLockRelids.  If
> copying the bitmapset removes those useless words, I don't see why we
> shouldn't do that.  So added:
>
> + /*
> + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
> + * bit from it just above to prevent empty tail bits resulting in
> + * inefficient looping during AcquireExecutorLocks().
> + */
> + if (glob->containsInitialPruning)
> + glob->minLockRelids = bms_copy(glob->minLockRelids)
>
> Not 100% about the comment I wrote.

And the quoted code change missed a semicolon in the v14 that I
hurriedly sent on Friday.   (Had apparently forgotten to `git add` the
hunk to fix that).

Sending v15 that fixes that to keep the cfbot green for now.

--
Amit Langote
EDB: http://www.enterprisedb.com
Hi,

+               /* RT index of the partitione table. */

partitione -> partitioned

Cheers

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> Sending v15 that fixes that to keep the cfbot green for now.
>
> Hi,
>
> +               /* RT index of the partitione table. */
>
> partitione -> partitioned

Thanks, fixed.

Also, I broke this into patches:

0001 contains the mechanical changes of moving PartitionPruneInfo out
of Append/MergeAppend into a list in PlannedStmt.

0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
only unpruned partitions".

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Zhihong Yu
Date:


On Fri, May 27, 2022 at 1:10 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, Apr 11, 2022 at 12:53 PM Zhihong Yu <zyu@yugabyte.com> wrote:
> On Sun, Apr 10, 2022 at 8:05 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> Sending v15 that fixes that to keep the cfbot green for now.
>
> Hi,
>
> +               /* RT index of the partitione table. */
>
> partitione -> partitioned

Thanks, fixed.

Also, I broke this into patches:

0001 contains the mechanical changes of moving PartitionPruneInfo out
of Append/MergeAppend into a list in PlannedStmt.

0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
only unpruned partitions".

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
Hi,
In the description:

is made available to the actual execution via
PartitionPruneResult, made available along with the PlannedStmt by the 

I think the second `made available` is redundant (can be omitted).

+ * Initial pruning is performed here if needed (unless it has already been done
+ * by ExecDoInitialPruning()), and in that case only the surviving subplans'

I wonder if there is a typo above - I don't find ExecDoInitialPruning either in PG codebase or in the patches (except for this in the comment).
I think it should be ExecutorDoInitialPruning.

+    * bit from it just above to prevent empty tail bits resulting in

I searched in the code base but didn't find mentioning of `empty tail bit`. Do you mind explaining a bit about it ?

Cheers

Re: generic plans and "initial" pruning

From
Jacob Champion
Date:
On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote:
> 0001 contains the mechanical changes of moving PartitionPruneInfo out
> of Append/MergeAppend into a list in PlannedStmt.
>
> 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
> only unpruned partitions".

This patchset will need to be rebased over 835d476fd21; looks like
just a cosmetic change.

--Jacob



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Jul 6, 2022 at 2:43 AM Jacob Champion <jchampion@timescale.com> wrote:
> On Fri, May 27, 2022 at 1:09 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > 0001 contains the mechanical changes of moving PartitionPruneInfo out
> > of Append/MergeAppend into a list in PlannedStmt.
> >
> > 0002 is the main patch to "Optimize AcquireExecutorLocks() by locking
> > only unpruned partitions".
>
> This patchset will need to be rebased over 835d476fd21; looks like
> just a cosmetic change.

Thanks for the heads up.

Rebased and also fixed per comments given by Zhihong Yu on May 28.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Rebased over 964d01ae90c.

Sorry, left some pointless hunks in there while rebasing.  Fixed in
the attached.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Jul 13, 2022 at 4:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jul 13, 2022 at 3:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Rebased over 964d01ae90c.
>
> Sorry, left some pointless hunks in there while rebasing.  Fixed in
> the attached.

Needed to be rebased again, over 2d04277121f this time.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Tue, Jul 26, 2022 at 11:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Needed to be rebased again, over 2d04277121f this time.

0001 adds es_part_prune_result but does not use it, so maybe the
introduction of that field should be deferred until it's needed for
something.

I wonder whether it's really necessary to added the PartitionPruneInfo
objects to a list in PlannerInfo first and then roll them up into
PlannerGlobal later. I know we do that for range table entries, but
I've never quite understood why we do it that way instead of creating
a flat range table in PlannerGlobal from the start. And so by
extension I wonder whether this table couldn't be flat from the start
also.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 26, 2022 at 11:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Needed to be rebased again, over 2d04277121f this time.

Thanks for looking.

> 0001 adds es_part_prune_result but does not use it, so maybe the
> introduction of that field should be deferred until it's needed for
> something.

Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.

> I wonder whether it's really necessary to added the PartitionPruneInfo
> objects to a list in PlannerInfo first and then roll them up into
> PlannerGlobal later. I know we do that for range table entries, but
> I've never quite understood why we do it that way instead of creating
> a flat range table in PlannerGlobal from the start. And so by
> extension I wonder whether this table couldn't be flat from the start
> also.

Tom may want to correct me but my understanding of why the planner
waits till the end of planning to start populating the PlannerGlobal
range table is that it is not until then that we know which subqueries
will be scanned by the final plan tree, so also whose range table
entries will be included in the range table passed to the executor.  I
can see that subquery pull-up causes a pulled-up subquery's range
table entries to be added into the parent's query's and all its nodes
changed using OffsetVarNodes() to refer to the new RT indexes.  But
for subqueries that are not pulled up, their subplans' nodes (present
in PlannerGlboal.subplans) would still refer to the original RT
indexes (per range table in the corresponding PlannerGlobal.subroot),
which must be fixed and the end of planning is the time to do so.  Or
maybe that could be done when build_subplan() creates a subplan and
adds it to PlannerGlobal.subplans, but for some reason it's not?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I wonder whether it's really necessary to added the PartitionPruneInfo
>> objects to a list in PlannerInfo first and then roll them up into
>> PlannerGlobal later. I know we do that for range table entries, but
>> I've never quite understood why we do it that way instead of creating
>> a flat range table in PlannerGlobal from the start. And so by
>> extension I wonder whether this table couldn't be flat from the start
>> also.

> Tom may want to correct me but my understanding of why the planner
> waits till the end of planning to start populating the PlannerGlobal
> range table is that it is not until then that we know which subqueries
> will be scanned by the final plan tree, so also whose range table
> entries will be included in the range table passed to the executor.

It would not be profitable to flatten the range table before we've
done remove_useless_joins.  We'd end up with useless entries from
subqueries that ultimately aren't there.  We could perhaps do it
after we finish that phase, but I don't really see the point: it
wouldn't be better than what we do now, just the same work at a
different time.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Jul 29, 2022 at 12:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It would not be profitable to flatten the range table before we've
> done remove_useless_joins.  We'd end up with useless entries from
> subqueries that ultimately aren't there.  We could perhaps do it
> after we finish that phase, but I don't really see the point: it
> wouldn't be better than what we do now, just the same work at a
> different time.

That's not quite my question, though. Why do we ever build a non-flat
range table in the first place? Like, instead of assigning indexes
relative to the current subquery level, why not just assign them
relative to the whole query from the start? It can't really be that
we've done it this way because of remove_useless_joins(), because
we've been building separate range tables and later flattening them
for longer than join removal has existed as a feature.

What bugs me is that it's very much not free. By building a bunch of
separate range tables and combining them later, we generate extra
work: we have to go back and adjust RT indexes after-the-fact. We pay
that overhead for every query, not just the ones that end up with some
unused entries in the range table. And why would it matter if we did
end up with some useless entries in the range table, anyway? If
there's some semantic difference, we could add a flag to mark those
entries as needing to be ignored, which seems way better than crawling
all over the whole tree adjusting RTIs everywhere.

I don't really expect that we're ever going to change this -- and
certainly not on this thread. The idea of running around and replacing
RT indexes all over the tree is deeply embedded in the system. But are
we really sure we want to add a second kind of index that we have to
run around and adjust at the same time?

If we are, so be it, I guess. It just looks really ugly and unnecessary to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That's not quite my question, though. Why do we ever build a non-flat
> range table in the first place? Like, instead of assigning indexes
> relative to the current subquery level, why not just assign them
> relative to the whole query from the start?

We could probably make that work, but I'm skeptical that it would
really be an improvement overall, for a couple of reasons.

(1) The need for merge-rangetables-and-renumber-Vars logic doesn't
go away.  It just moves from setrefs.c to the rewriter, which would
have to do it when expanding views.  This would be a net loss
performance-wise, I think, because setrefs.c can do it as part of a
parsetree scan that it has to perform anyway for other housekeeping
reasons; but the rewriter would need a brand new pass over the tree.
Admittedly that pass would only happen for view replacement, but
it's still not open-and-shut that there'd be a performance win.

(2) The need for varlevelsup and similar fields doesn't go away,
I think, because we need those for semantic purposes such as
discovering the query level that aggregates are associated with.
That means that subquery flattening still has to make a pass over
the tree to touch every Var's varlevelsup; so not having to adjust
varno at the same time would save little.

I'm not sure whether I think it's a net plus or net minus that
varno would become effectively independent of varlevelsup.
It'd be different from the way we think of them now, for sure,
and I think it'd take awhile to flush out bugs arising from such
a redefinition.

> I don't really expect that we're ever going to change this -- and
> certainly not on this thread. The idea of running around and replacing
> RT indexes all over the tree is deeply embedded in the system. But are
> we really sure we want to add a second kind of index that we have to
> run around and adjust at the same time?

You probably want to avert your eyes from [1], then ;-).  Although
I'm far from convinced that the cross-list index fields currently
proposed there are actually necessary; the cost to adjust them
during rangetable merging could outweigh any benefit.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CA+HiwqGjJDmUhDSfv-U2qhKJjt9ST7Xh9JXC_irsAQ1TAUsJYg@mail.gmail.com



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Jul 29, 2022 at 11:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> We could probably make that work, but I'm skeptical that it would
> really be an improvement overall, for a couple of reasons.
>
> (1) The need for merge-rangetables-and-renumber-Vars logic doesn't
> go away.  It just moves from setrefs.c to the rewriter, which would
> have to do it when expanding views.  This would be a net loss
> performance-wise, I think, because setrefs.c can do it as part of a
> parsetree scan that it has to perform anyway for other housekeeping
> reasons; but the rewriter would need a brand new pass over the tree.
> Admittedly that pass would only happen for view replacement, but
> it's still not open-and-shut that there'd be a performance win.
>
> (2) The need for varlevelsup and similar fields doesn't go away,
> I think, because we need those for semantic purposes such as
> discovering the query level that aggregates are associated with.
> That means that subquery flattening still has to make a pass over
> the tree to touch every Var's varlevelsup; so not having to adjust
> varno at the same time would save little.
>
> I'm not sure whether I think it's a net plus or net minus that
> varno would become effectively independent of varlevelsup.
> It'd be different from the way we think of them now, for sure,
> and I think it'd take awhile to flush out bugs arising from such
> a redefinition.

Interesting. Thanks for your thoughts. I guess it's not as clear-cut
as I thought, but I still can't help feeling like we're doing an awful
lot of expensive rearrangement at the end of query planning.

I kind of wonder whether varlevelsup is the wrong idea. Like, suppose
we instead handed out subquery identifiers serially, sort of like what
we do with SubTransactionId values. Then instead of testing whether
varlevelsup>0 you test whether varsubqueryid==mysubqueryid. If you
flatten a query into its parent, you still need to adjust every var
that refers to the dead subquery, but you don't need to adjust vars
that refer to subqueries underneath it. Their level changes, but their
identity doesn't. Maybe that doesn't really help that much, but it's
always struck me as a little unfortunate that we basically test
whether a var is equal by testing whether the varno and varlevelsup
are equal. That only works if you assume that you can never end up
comparing two vars from thoroughly unrelated parts of the tree, such
that the subquery one level up from one might be different from the
subquery one level up from the other.

> > I don't really expect that we're ever going to change this -- and
> > certainly not on this thread. The idea of running around and replacing
> > RT indexes all over the tree is deeply embedded in the system. But are
> > we really sure we want to add a second kind of index that we have to
> > run around and adjust at the same time?
>
> You probably want to avert your eyes from [1], then ;-).  Although
> I'm far from convinced that the cross-list index fields currently
> proposed there are actually necessary; the cost to adjust them
> during rangetable merging could outweigh any benefit.

I really like the idea of that patch overall, actually; I think
permissions checking is a good example of something that shouldn't
require walking the whole query tree but currently does. And actually,
I think the same thing is true here: we shouldn't need to walk the
whole query tree to find the pruning information, but right now we do.
I'm just uncertain whether what Amit has implemented is the
least-annoying way to go about it... any thoughts on that,
specifically as it pertains to this patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... it's
> always struck me as a little unfortunate that we basically test
> whether a var is equal by testing whether the varno and varlevelsup
> are equal. That only works if you assume that you can never end up
> comparing two vars from thoroughly unrelated parts of the tree, such
> that the subquery one level up from one might be different from the
> subquery one level up from the other.

Yeah, that's always bothered me a little as well.  I've yet to see a
case where it causes a problem in practice.  But I think that if, say,
we were to try to do any sort of cross-query-level optimization, then
the ambiguity could rise up to bite us.  That might be a situation
where a flat rangetable would be worth the trouble.

> I'm just uncertain whether what Amit has implemented is the
> least-annoying way to go about it... any thoughts on that,
> specifically as it pertains to this patch?

I haven't looked at this patch at all.  I'll try to make some
time for it, but probably not today.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Jul 29, 2022 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'm just uncertain whether what Amit has implemented is the
> > least-annoying way to go about it... any thoughts on that,
> > specifically as it pertains to this patch?
>
> I haven't looked at this patch at all.  I'll try to make some
> time for it, but probably not today.

OK, thanks. The preliminary patch I'm talking about here is pretty
short, so you could probably look at that part of it, at least, in
some relatively small amount of time. And I think it's also in pretty
reasonable shape apart from this issue. But, as usual, there's the
question of how well one can evaluate a preliminary patch without
reviewing the full patch in detail.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > 0001 adds es_part_prune_result but does not use it, so maybe the
> > introduction of that field should be deferred until it's needed for
> > something.
>
> Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.

Fixed that and also noticed that I had defined PartitionPruneResult in
the wrong header (execnodes.h).  That led to PartitionPruneResult
nodes not being able to be written and read, because
src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
routines for the nodes defined in execnodes.h.  I moved its definition
to plannodes.h, even though it is not actually the planner that
instantiates those; no other include/nodes header sounds better.

One more thing I realized is that Bitmapsets added to the List
PartitionPruneResult.valid_subplan_offs_list are not actually
read/write-able.  That's a problem that I also faced in [1], so I
proposed a patch there to make Bitmapset a read/write-able Node and
mark (only) the Bitmapsets that are added into read/write-able node
trees with the corresponding NodeTag.  I'm including that patch here
as well (0002) for the main patch to work (pass
-DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
to discuss it in its own thread?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqH80qX1ZLx3HyHmBrOzLQeuKuGx6FzGep0F_9zw9L4PAA%40mail.gmail.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > introduction of that field should be deferred until it's needed for
> > > something.
> >
> > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
>
> Fixed that and also noticed that I had defined PartitionPruneResult in
> the wrong header (execnodes.h).  That led to PartitionPruneResult
> nodes not being able to be written and read, because
> src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> routines for the nodes defined in execnodes.h.  I moved its definition
> to plannodes.h, even though it is not actually the planner that
> instantiates those; no other include/nodes header sounds better.
>
> One more thing I realized is that Bitmapsets added to the List
> PartitionPruneResult.valid_subplan_offs_list are not actually
> read/write-able.  That's a problem that I also faced in [1], so I
> proposed a patch there to make Bitmapset a read/write-able Node and
> mark (only) the Bitmapsets that are added into read/write-able node
> trees with the corresponding NodeTag.  I'm including that patch here
> as well (0002) for the main patch to work (pass
> -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> to discuss it in its own thread?

Had second thoughts on the use of List of Bitmapsets for this, such
that the make-Bitmapset-Nodes patch is no longer needed.

I had defined PartitionPruneResult such that it stood for the results
of pruning for all PartitionPruneInfos contained in
PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
can use partition pruning in a given plan).  So, it had a List of
Bitmapset.  I think it's perhaps better for PartitionPruneResult to
cover only one PartitionPruneInfo and thus need only a Bitmapset and
not a List thereof, which I have implemented in the attached updated
patch 0002.  So, instead of needing to pass around a
PartitionPruneResult with each PlannedStmt, this now passes a List of
PartitionPruneResult with an entry for each in
PlannedStmt.partPruneInfos.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > introduction of that field should be deferred until it's needed for
> > > > something.
> > >
> > > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
> >
> > Fixed that and also noticed that I had defined PartitionPruneResult in
> > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > nodes not being able to be written and read, because
> > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > routines for the nodes defined in execnodes.h.  I moved its definition
> > to plannodes.h, even though it is not actually the planner that
> > instantiates those; no other include/nodes header sounds better.
> >
> > One more thing I realized is that Bitmapsets added to the List
> > PartitionPruneResult.valid_subplan_offs_list are not actually
> > read/write-able.  That's a problem that I also faced in [1], so I
> > proposed a patch there to make Bitmapset a read/write-able Node and
> > mark (only) the Bitmapsets that are added into read/write-able node
> > trees with the corresponding NodeTag.  I'm including that patch here
> > as well (0002) for the main patch to work (pass
> > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > to discuss it in its own thread?
>
> Had second thoughts on the use of List of Bitmapsets for this, such
> that the make-Bitmapset-Nodes patch is no longer needed.
>
> I had defined PartitionPruneResult such that it stood for the results
> of pruning for all PartitionPruneInfos contained in
> PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> can use partition pruning in a given plan).  So, it had a List of
> Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> cover only one PartitionPruneInfo and thus need only a Bitmapset and
> not a List thereof, which I have implemented in the attached updated
> patch 0002.  So, instead of needing to pass around a
> PartitionPruneResult with each PlannedStmt, this now passes a List of
> PartitionPruneResult with an entry for each in
> PlannedStmt.partPruneInfos.

Rebased over 3b2db22fe.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Oct 27, 2022 at 11:41 AM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Oct 17, 2022 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > > 0001 adds es_part_prune_result but does not use it, so maybe the
> > > > > introduction of that field should be deferred until it's needed for
> > > > > something.
> > > >
> > > > Oops, looks like a mistake when breaking the patch.  Will move that bit to 0002.
> > >
> > > Fixed that and also noticed that I had defined PartitionPruneResult in
> > > the wrong header (execnodes.h).  That led to PartitionPruneResult
> > > nodes not being able to be written and read, because
> > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read*
> > > routines for the nodes defined in execnodes.h.  I moved its definition
> > > to plannodes.h, even though it is not actually the planner that
> > > instantiates those; no other include/nodes header sounds better.
> > >
> > > One more thing I realized is that Bitmapsets added to the List
> > > PartitionPruneResult.valid_subplan_offs_list are not actually
> > > read/write-able.  That's a problem that I also faced in [1], so I
> > > proposed a patch there to make Bitmapset a read/write-able Node and
> > > mark (only) the Bitmapsets that are added into read/write-able node
> > > trees with the corresponding NodeTag.  I'm including that patch here
> > > as well (0002) for the main patch to work (pass
> > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense
> > > to discuss it in its own thread?
> >
> > Had second thoughts on the use of List of Bitmapsets for this, such
> > that the make-Bitmapset-Nodes patch is no longer needed.
> >
> > I had defined PartitionPruneResult such that it stood for the results
> > of pruning for all PartitionPruneInfos contained in
> > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that
> > can use partition pruning in a given plan).  So, it had a List of
> > Bitmapset.  I think it's perhaps better for PartitionPruneResult to
> > cover only one PartitionPruneInfo and thus need only a Bitmapset and
> > not a List thereof, which I have implemented in the attached updated
> > patch 0002.  So, instead of needing to pass around a
> > PartitionPruneResult with each PlannedStmt, this now passes a List of
> > PartitionPruneResult with an entry for each in
> > PlannedStmt.partPruneInfos.
>
> Rebased over 3b2db22fe.

Updated 0002 to cope with AssertArg() being removed from the tree.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
Looking at 0001, I wonder if we should have a crosscheck that a
PartitionPruneInfo you got from following an index is indeed constructed
for the relation that you think it is: previously, you were always sure
that the prune struct is for this node, because you followed a pointer
that was set up in the node itself.  Now you only have an index, and you
have to trust that the index is correct.

I'm not sure how to implement this, or even if it's doable at all.
Keeping the OID of the partitioned table in the PartitionPruneInfo
struct is easy, but I don't know how to check it in ExecInitMergeAppend
and ExecInitAppend.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Alvaro,

Thanks for looking at this one.

On Thu, Dec 1, 2022 at 3:12 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Looking at 0001, I wonder if we should have a crosscheck that a
> PartitionPruneInfo you got from following an index is indeed constructed
> for the relation that you think it is: previously, you were always sure
> that the prune struct is for this node, because you followed a pointer
> that was set up in the node itself.  Now you only have an index, and you
> have to trust that the index is correct.

Yeah, a crosscheck sounds like a good idea.

> I'm not sure how to implement this, or even if it's doable at all.
> Keeping the OID of the partitioned table in the PartitionPruneInfo
> struct is easy, but I don't know how to check it in ExecInitMergeAppend
> and ExecInitAppend.

Hmm, how about keeping the [Merge]Append's parent relation's RT index
in the PartitionPruneInfo and passing it down to
ExecInitPartitionPruning() from ExecInit[Merge]Append() for
cross-checking?  Both Append and MergeAppend already have a
'apprelids' field that we can save a copy of in the
PartitionPruneInfo.  Tried that in the attached delta patch.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Dec-01, Amit Langote wrote:

> Hmm, how about keeping the [Merge]Append's parent relation's RT index
> in the PartitionPruneInfo and passing it down to
> ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> cross-checking?  Both Append and MergeAppend already have a
> 'apprelids' field that we can save a copy of in the
> PartitionPruneInfo.  Tried that in the attached delta patch.

Ah yeah, that sounds about what I was thinking.  I've merged that in and
pushed to github, which had a strange pg_upgrade failure on Windows
mentioning log files that were not captured by the CI tooling.  So I
pushed another one trying to grab those files, in case it wasn't an
one-off failure.  It's running now:
  https://cirrus-ci.com/task/5857239638999040

If all goes well with this run, I'll get this 0001 pushed.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-01, Amit Langote wrote:
> > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > in the PartitionPruneInfo and passing it down to
> > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > cross-checking?  Both Append and MergeAppend already have a
> > 'apprelids' field that we can save a copy of in the
> > PartitionPruneInfo.  Tried that in the attached delta patch.
>
> Ah yeah, that sounds about what I was thinking.  I've merged that in and
> pushed to github, which had a strange pg_upgrade failure on Windows
> mentioning log files that were not captured by the CI tooling.  So I
> pushed another one trying to grab those files, in case it wasn't an
> one-off failure.  It's running now:
>   https://cirrus-ci.com/task/5857239638999040
>
> If all goes well with this run, I'll get this 0001 pushed.

Thanks for pushing 0001.

Rebased 0002 attached.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-Dec-01, Amit Langote wrote:
> > > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > > in the PartitionPruneInfo and passing it down to
> > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > > cross-checking?  Both Append and MergeAppend already have a
> > > 'apprelids' field that we can save a copy of in the
> > > PartitionPruneInfo.  Tried that in the attached delta patch.
> >
> > Ah yeah, that sounds about what I was thinking.  I've merged that in and
> > pushed to github, which had a strange pg_upgrade failure on Windows
> > mentioning log files that were not captured by the CI tooling.  So I
> > pushed another one trying to grab those files, in case it wasn't an
> > one-off failure.  It's running now:
> >   https://cirrus-ci.com/task/5857239638999040
> >
> > If all goes well with this run, I'll get this 0001 pushed.
>
> Thanks for pushing 0001.
>
> Rebased 0002 attached.

Thought it might be good for PartitionPruneResult to also have
root_parent_relids that matches with the corresponding
PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
that the root_parent_relids of a given pair of PartitionPrune{Info |
Result} match.

Posting the patch separately as the attached 0002, just in case you
might think that the extra cross-checking would be an overkill.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Dec 1, 2022 at 9:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > On 2022-Dec-01, Amit Langote wrote:
> > > > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > > > in the PartitionPruneInfo and passing it down to
> > > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > > > cross-checking?  Both Append and MergeAppend already have a
> > > > 'apprelids' field that we can save a copy of in the
> > > > PartitionPruneInfo.  Tried that in the attached delta patch.
> > >
> > > Ah yeah, that sounds about what I was thinking.  I've merged that in and
> > > pushed to github, which had a strange pg_upgrade failure on Windows
> > > mentioning log files that were not captured by the CI tooling.  So I
> > > pushed another one trying to grab those files, in case it wasn't an
> > > one-off failure.  It's running now:
> > >   https://cirrus-ci.com/task/5857239638999040
> > >
> > > If all goes well with this run, I'll get this 0001 pushed.
> >
> > Thanks for pushing 0001.
> >
> > Rebased 0002 attached.
>
> Thought it might be good for PartitionPruneResult to also have
> root_parent_relids that matches with the corresponding
> PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
> that the root_parent_relids of a given pair of PartitionPrune{Info |
> Result} match.
>
> Posting the patch separately as the attached 0002, just in case you
> might think that the extra cross-checking would be an overkill.

Rebased over 92c4dafe1eed and fixed some factual mistakes in the
comment above ExecutorDoInitialPruning().

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Dec 5, 2022 at 12:00 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Dec 2, 2022 at 7:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Thought it might be good for PartitionPruneResult to also have
> > root_parent_relids that matches with the corresponding
> > PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
> > that the root_parent_relids of a given pair of PartitionPrune{Info |
> > Result} match.
> >
> > Posting the patch separately as the attached 0002, just in case you
> > might think that the extra cross-checking would be an overkill.
>
> Rebased over 92c4dafe1eed and fixed some factual mistakes in the
> comment above ExecutorDoInitialPruning().

Sorry, I had forgotten to git-add hunks including some cosmetic
changes in that one.  Here's another version.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
I find the API of GetCachedPlans a little weird after this patch.  I
think it may be better to have it return a pointer of a new struct --
one that contains both the CachedPlan pointer and the list of pruning
results.  (As I understand, the sole caller that isn't interested in the
pruning results, SPI_plan_get_cached_plan, can be explained by the fact
that it knows there won't be any.  So I don't think we need to worry
about this case?)

And I think you should make that struct also be the last argument of
PortalDefineQuery, so you don't need the separate
PortalStorePartitionPruneResults function -- because as far as I can
tell, the callers that pass a non-NULL pointer there are the exactly
same that later call PortalStorePartitionPruneResults.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Thanks for the review.

On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I find the API of GetCachedPlans a little weird after this patch.  I
> think it may be better to have it return a pointer of a new struct --
> one that contains both the CachedPlan pointer and the list of pruning
> results.  (As I understand, the sole caller that isn't interested in the
> pruning results, SPI_plan_get_cached_plan, can be explained by the fact
> that it knows there won't be any.  So I don't think we need to worry
> about this case?)

David, in his Apr 7 reply on this thread, also sounded to suggest
something similar.

Hmm, I was / am not so sure if GetCachedPlan() should return something
that is not CachedPlan.  An idea I had today was to replace the
part_prune_results_list output List parameter with, say,
QueryInitPruningResult, or something like that and put the current
list into that struct.   Was looking at QueryEnvironment to come up
with *that* name.  Any thoughts?

> And I think you should make that struct also be the last argument of
> PortalDefineQuery, so you don't need the separate
> PortalStorePartitionPruneResults function -- because as far as I can
> tell, the callers that pass a non-NULL pointer there are the exactly
> same that later call PortalStorePartitionPruneResults.

Yes, it would be better to not need PortalStorePartitionPruneResults.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Dec-09, Amit Langote wrote:

> On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I find the API of GetCachedPlans a little weird after this patch.

> David, in his Apr 7 reply on this thread, also sounded to suggest
> something similar.
> 
> Hmm, I was / am not so sure if GetCachedPlan() should return something
> that is not CachedPlan.  An idea I had today was to replace the
> part_prune_results_list output List parameter with, say,
> QueryInitPruningResult, or something like that and put the current
> list into that struct.   Was looking at QueryEnvironment to come up
> with *that* name.  Any thoughts?

Remind me again why is part_prune_results_list not part of struct
CachedPlan then?  I tried to understand that based on comments upthread,
but I was unable to find anything.

(My first reaction to your above comment was "well, rename GetCachedPlan
then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is
in any way a structure that must be "immutable" in the way parser output
is.  Looking at the comment at the top of plancache.c it appears to me
that it isn't, but maybe I'm missing something.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The Postgresql hackers have what I call a "NASA space shot" mentality.
 Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-09, Amit Langote wrote:
> > On Wed, Dec 7, 2022 at 4:00 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > I find the API of GetCachedPlans a little weird after this patch.
>
> > David, in his Apr 7 reply on this thread, also sounded to suggest
> > something similar.
> >
> > Hmm, I was / am not so sure if GetCachedPlan() should return something
> > that is not CachedPlan.  An idea I had today was to replace the
> > part_prune_results_list output List parameter with, say,
> > QueryInitPruningResult, or something like that and put the current
> > list into that struct.   Was looking at QueryEnvironment to come up
> > with *that* name.  Any thoughts?
>
> Remind me again why is part_prune_results_list not part of struct
> CachedPlan then?  I tried to understand that based on comments upthread,
> but I was unable to find anything.

It used to be part of CachedPlan for a brief period of time (in patch
v12 I posted in [1]), but David, in his reply to [1], said he wasn't
so sure that it belonged there.

> (My first reaction to your above comment was "well, rename GetCachedPlan
> then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is
> in any way a structure that must be "immutable" in the way parser output
> is.  Looking at the comment at the top of plancache.c it appears to me
> that it isn't, but maybe I'm missing something.)

CachedPlan *is* supposed to be read-only per the comment above
CachedPlanSource definition:

 * ...If we are using a generic
 * cached plan then it is meant to be re-used across multiple executions, so
 * callers must always treat CachedPlans as read-only.

FYI, there was even an idea of putting a PartitionPruneResults for a
given PlannedStmt into the PlannedStmt itself [2], but PlannedStmt is
supposed to be read-only too [3].

Maybe we need some new overarching context when invoking plancache, if
Portal can't already be it, whose struct can be passed to
GetCachedPlan() to put the pruning results in?  Perhaps,
GetRunnablePlan() that you floated could be a wrapper for
GetCachedPlan(), owning that new context.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqH4qQ_YVROr7TY0jSCuGn0oHhH79_DswOdXWN5UnMCBtQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAApHDvp_DjVVkgSV24%2BUF7p_yKWeepgoo%2BW2SWLLhNmjwHTVYQ%40mail.gmail.com
[3] https://www.postgresql.org/message-id/922566.1648784745%40sss.pgh.pa.us



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Dec-09, Amit Langote wrote:

> On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Remind me again why is part_prune_results_list not part of struct
> > CachedPlan then?  I tried to understand that based on comments upthread,
> > but I was unable to find anything.
> 
> It used to be part of CachedPlan for a brief period of time (in patch
> v12 I posted in [1]), but David, in his reply to [1], said he wasn't
> so sure that it belonged there.

I'm not sure I necessarily agree with that.  I'll have a look at v12 to
try and understand what was David so unhappy about.

> > (My first reaction to your above comment was "well, rename GetCachedPlan
> > then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is
> > in any way a structure that must be "immutable" in the way parser output
> > is.  Looking at the comment at the top of plancache.c it appears to me
> > that it isn't, but maybe I'm missing something.)
> 
> CachedPlan *is* supposed to be read-only per the comment above
> CachedPlanSource definition:
> 
>  * ...If we are using a generic
>  * cached plan then it is meant to be re-used across multiple executions, so
>  * callers must always treat CachedPlans as read-only.

I read that as implying that the part_prune_results_list must remain
intact as long as no invalidations occur.  Does part_prune_result_list
really change as a result of something other than a sinval event?
Keep in mind that if a sinval message that touches one of the relations
in the plan arrives, then we'll discard it and generate it afresh.  I
don't see that the part_prune_results_list would change otherwise, but
maybe I misunderstand?

> FYI, there was even an idea of putting a PartitionPruneResults for a
> given PlannedStmt into the PlannedStmt itself [2], but PlannedStmt is
> supposed to be read-only too [3].

Hmm, I'm not familiar with PlannedStmt lifetime, but I'm definitely not
betting that Tom is wrong about this.

> Maybe we need some new overarching context when invoking plancache, if
> Portal can't already be it, whose struct can be passed to
> GetCachedPlan() to put the pruning results in?  Perhaps,
> GetRunnablePlan() that you floated could be a wrapper for
> GetCachedPlan(), owning that new context.

Perhaps that is a solution.  I'm not sure.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Dec 9, 2022 at 7:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-09, Amit Langote wrote:
> > On Fri, Dec 9, 2022 at 6:52 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > Remind me again why is part_prune_results_list not part of struct
> > > CachedPlan then?  I tried to understand that based on comments upthread,
> > > but I was unable to find anything.
> >
> > > (My first reaction to your above comment was "well, rename GetCachedPlan
> > > then, maybe to GetRunnablePlan", but then I'm wondering if CachedPlan is
> > > in any way a structure that must be "immutable" in the way parser output
> > > is.  Looking at the comment at the top of plancache.c it appears to me
> > > that it isn't, but maybe I'm missing something.)
> >
> > CachedPlan *is* supposed to be read-only per the comment above
> > CachedPlanSource definition:
> >
> >  * ...If we are using a generic
> >  * cached plan then it is meant to be re-used across multiple executions, so
> >  * callers must always treat CachedPlans as read-only.
>
> I read that as implying that the part_prune_results_list must remain
> intact as long as no invalidations occur.  Does part_prune_result_list
> really change as a result of something other than a sinval event?
> Keep in mind that if a sinval message that touches one of the relations
> in the plan arrives, then we'll discard it and generate it afresh.  I
> don't see that the part_prune_results_list would change otherwise, but
> maybe I misunderstand?

Pruning will be done afresh on every fetch of a given cached plan when
CheckCachedPlan() is called on it, so the part_prune_results_list part
will be discarded and rebuilt as many times as the plan is executed.
You'll find a description around CachedPlanSavePartitionPruneResults()
that's in v12.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Dec-09, Amit Langote wrote:

> Pruning will be done afresh on every fetch of a given cached plan when
> CheckCachedPlan() is called on it, so the part_prune_results_list part
> will be discarded and rebuilt as many times as the plan is executed.
> You'll find a description around CachedPlanSavePartitionPruneResults()
> that's in v12.

I see.

In that case, a separate container struct seems warranted.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Dec 9, 2022 at 8:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-09, Amit Langote wrote:
>
> > Pruning will be done afresh on every fetch of a given cached plan when
> > CheckCachedPlan() is called on it, so the part_prune_results_list part
> > will be discarded and rebuilt as many times as the plan is executed.
> > You'll find a description around CachedPlanSavePartitionPruneResults()
> > that's in v12.
>
> I see.
>
> In that case, a separate container struct seems warranted.

I thought about this today and played around with some container struct ideas.

Though, I started feeling like putting all the new logic being added
by this patch into plancache.c at the heart of GetCachedPlan() and
tweaking its API in kind of unintuitive ways may not have been such a
good idea to begin with.  So I started thinking again about your
GetRunnablePlan() wrapper idea and thought maybe we could do something
with it.  Let's say we name it GetCachedPlanLockPartitions() and put
the logic that does initial pruning with the new
ExecutorDoInitialPruning() in it, instead of in the normal
GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
call GetCachedPlanLockPartitions() with either the List ** parameter
as now or some container struct if that seems better.  Whether
GetCachedPlanLockPartitions() needs to do anything other than return
the CachedPlan returned by GetCachedPlan() can be decided by the
latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
any of the PlannedStmts it sees, locking only the
PlannedStmt.minLockRelids set (which is all relations where no pruning
is needed!), leaving the partition locking to
GetCachedPlanLockPartitions().  If the CachedPlan is invalidated
during the partition locking phase, it calls GetCachedPlan() again;
maybe some refactoring is needed to avoid too much useless work in
such cases.

Thoughts?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
On 2022-Dec-12, Amit Langote wrote:

> I started feeling like putting all the new logic being added
> by this patch into plancache.c at the heart of GetCachedPlan() and
> tweaking its API in kind of unintuitive ways may not have been such a
> good idea to begin with.  So I started thinking again about your
> GetRunnablePlan() wrapper idea and thought maybe we could do something
> with it.  Let's say we name it GetCachedPlanLockPartitions() and put
> the logic that does initial pruning with the new
> ExecutorDoInitialPruning() in it, instead of in the normal
> GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
> call GetCachedPlanLockPartitions() with either the List ** parameter
> as now or some container struct if that seems better.  Whether
> GetCachedPlanLockPartitions() needs to do anything other than return
> the CachedPlan returned by GetCachedPlan() can be decided by the
> latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
> done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
> any of the PlannedStmts it sees, locking only the
> PlannedStmt.minLockRelids set (which is all relations where no pruning
> is needed!), leaving the partition locking to
> GetCachedPlanLockPartitions().

Hmm.  This doesn't sound totally unreasonable, except to the point David
was making that perhaps we may want this container struct to accomodate
other things in the future than just the partition pruning results, so I
think its name (and that of the function that produces it) ought to be a
little more generic than that.

(I think this also answers your question on whether a List ** is better
than a container struct.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Dec 13, 2022 at 2:24 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-Dec-12, Amit Langote wrote:
> > I started feeling like putting all the new logic being added
> > by this patch into plancache.c at the heart of GetCachedPlan() and
> > tweaking its API in kind of unintuitive ways may not have been such a
> > good idea to begin with.  So I started thinking again about your
> > GetRunnablePlan() wrapper idea and thought maybe we could do something
> > with it.  Let's say we name it GetCachedPlanLockPartitions() and put
> > the logic that does initial pruning with the new
> > ExecutorDoInitialPruning() in it, instead of in the normal
> > GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
> > call GetCachedPlanLockPartitions() with either the List ** parameter
> > as now or some container struct if that seems better.  Whether
> > GetCachedPlanLockPartitions() needs to do anything other than return
> > the CachedPlan returned by GetCachedPlan() can be decided by the
> > latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
> > done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
> > any of the PlannedStmts it sees, locking only the
> > PlannedStmt.minLockRelids set (which is all relations where no pruning
> > is needed!), leaving the partition locking to
> > GetCachedPlanLockPartitions().
>
> Hmm.  This doesn't sound totally unreasonable, except to the point David
> was making that perhaps we may want this container struct to accomodate
> other things in the future than just the partition pruning results, so I
> think its name (and that of the function that produces it) ought to be a
> little more generic than that.
>
> (I think this also answers your question on whether a List ** is better
> than a container struct.)

OK, so here's a WIP attempt at that.

I have moved the original functionality of GetCachedPlan() to
GetCachedPlanInternal(), turning the former into a sort of controller
as described shortly.  The latter's CheckCachedPlan() part now only
locks the "minimal" set of, non-prunable, relations, making a note of
whether the plan contains any prunable subnodes and thus prunable
relations whose locking is deferred to the caller, GetCachedPlan().
GetCachedPlan(), as a sort of controller as mentioned before, does the
pruning if needed on the minimally valid plan returned by
GetCachedPlanInternal(), locks the partitions that survive, and redoes
the whole thing if the locking of partitions invalidates the plan.

The pruning results are returned through the new output parameter of
GetCachedPlan() of type CachedPlanExtra.  I named it so after much
consideration, because all the new logic that produces stuff to put
into it is a part of the plancache module and has to do with
manipulating a CachedPlan.  (I had considered CachedPlanExecInfo to
indicate that it contains information that is to be forwarded to the
executor, though that just didn't seem to fit in plancache.h.)

I have broken out a few things into a preparatory patch 0001.  Mainly,
it invents PlannedStmt.minLockRelids to replace the
AcquireExecutorLocks()'s current loop over the range table to figure
out the relations to lock.  I also threw in a couple of pruning
related non-functional changes in there to make it easier to read the
0002, which is the main patch.



--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Dec 14, 2022 at 5:35 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I have moved the original functionality of GetCachedPlan() to
> GetCachedPlanInternal(), turning the former into a sort of controller
> as described shortly.  The latter's CheckCachedPlan() part now only
> locks the "minimal" set of, non-prunable, relations, making a note of
> whether the plan contains any prunable subnodes and thus prunable
> relations whose locking is deferred to the caller, GetCachedPlan().
> GetCachedPlan(), as a sort of controller as mentioned before, does the
> pruning if needed on the minimally valid plan returned by
> GetCachedPlanInternal(), locks the partitions that survive, and redoes
> the whole thing if the locking of partitions invalidates the plan.

After sleeping on it, I realized this doesn't have to be that
complicated.   Rather than turn GetCachedPlan() into a wrapper for
handling deferred partition locking as outlined above, I could have
changed it more simply as follows to get the same thing done:

    if (!customplan)
    {
-       if (CheckCachedPlan(plansource))
+       bool    hasUnlockedParts = false;
+
+       if (CheckCachedPlan(plansource, &hasUnlockedParts) &&
+           hasUnlockedParts &&
+           CachedPlanLockPartitions(plansource, boundParams, owner, extra))
        {
            /* We want a generic plan, and we already have a valid one */
            plan = plansource->gplan;

Attached updated patch does it like that.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
This version of the patch looks not entirely unreasonable to me.  I'll
set this as Ready for Committer in case David or Tom or someone else
want to have a look and potentially commit it.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Dec 21, 2022 at 7:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> This version of the patch looks not entirely unreasonable to me.  I'll
> set this as Ready for Committer in case David or Tom or someone else
> want to have a look and potentially commit it.

Thank you, Alvaro.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> This version of the patch looks not entirely unreasonable to me.  I'll
> set this as Ready for Committer in case David or Tom or someone else
> want to have a look and potentially commit it.

I will have a look during the January CF.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
I spent some time re-reading this whole thread, and the more I read
the less happy I got.  We are adding a lot of complexity and introducing
coding hazards that will surely bite somebody someday.  And after awhile
I had what felt like an epiphany: the whole problem arises because the
system is wrongly factored.  We should get rid of AcquireExecutorLocks
altogether, allowing the plancache to hand back a generic plan that
it's not certain of the validity of, and instead integrate the
responsibility for acquiring locks into executor startup.  It'd have
to be optional there, since we don't need new locks in the case of
executing a just-planned plan; but we can easily add another eflags
bit (EXEC_FLAG_GET_LOCKS or so).  Then there has to be a convention
whereby the ExecInitNode traversal can return an indicator that
"we failed because the plan is stale, please make a new plan".

There are a couple reasons why this feels like a good idea:

* There's no need for worry about keeping the locking decisions in sync
with what executor startup does.

* We don't need to add the overhead proposed in the current patch to
pass forward data about what got locked/pruned.  While that overhead
is hopefully less expensive than the locks it saved acquiring, it's
still overhead (and in some cases the patch will fail to save acquiring
any locks, making it certainly a net negative).

* In a successfully built execution state tree, there will simply
not be any nodes corresponding to pruned-away, never-locked subplans.
As long as code like EXPLAIN follows the state tree and doesn't poke
into plan nodes that have no matching state, it's secure against the
sort of problems that Robert worried about upthread.

While I've not attempted to write any code for this, I can also
think of a few issues that'd have to be resolved:

* We'd be pushing the responsibility for looping back and re-planning
out to fairly high-level calling code.  There are only half a dozen
callers of GetCachedPlan, so there's not that many places to be
touched; but in some of those places the subsequent executor-start call
is not close by, so that the necessary refactoring might be pretty
painful.  I doubt there's anything insurmountable, but we'd definitely
be changing some fundamental APIs.

* In some cases (views, at least) we need to acquire lock on relations
that aren't directly reflected anywhere in the plan tree.  So there'd
have to be a separate mechanism for getting those locks and rechecking
validity afterward.  A list of relevant relation OIDs might be enough
for that.

* We currently do ExecCheckPermissions() before initializing the
plan state tree.  It won't do to check permissions on relations we
haven't yet locked, so that responsibility would have to be moved.
Maybe that could also be integrated into the initialization recursion?
Not sure.

* In the existing usage of AcquireExecutorLocks, if we do decide
that the plan is stale then we are able to release all the locks
we got before we go off and replan.  I'm not certain if that behavior
needs to be preserved, but if it does then that would require some
additional bookkeeping in the executor.

* This approach is optimizing on the assumption that we usually
won't need to replan, because if we do then we might waste a fair
amount of executor startup overhead before discovering we have
to throw all that state away.  I think that's clearly the right
way to bet, but perhaps somebody else has a different view.

Thoughts?

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent some time re-reading this whole thread, and the more I read
> the less happy I got.

Thanks a lot for your time on this.

>  We are adding a lot of complexity and introducing
> coding hazards that will surely bite somebody someday.  And after awhile
> I had what felt like an epiphany: the whole problem arises because the
> system is wrongly factored.  We should get rid of AcquireExecutorLocks
> altogether, allowing the plancache to hand back a generic plan that
> it's not certain of the validity of, and instead integrate the
> responsibility for acquiring locks into executor startup.  It'd have
> to be optional there, since we don't need new locks in the case of
> executing a just-planned plan; but we can easily add another eflags
> bit (EXEC_FLAG_GET_LOCKS or so).  Then there has to be a convention
> whereby the ExecInitNode traversal can return an indicator that
> "we failed because the plan is stale, please make a new plan".

Interesting.  The current implementation relies on
PlanCacheRelCallback() marking a generic CachedPlan as invalid, so
perhaps there will have to be some sharing of state between the
plancache and the executor for this to work?

> There are a couple reasons why this feels like a good idea:
>
> * There's no need for worry about keeping the locking decisions in sync
> with what executor startup does.
>
> * We don't need to add the overhead proposed in the current patch to
> pass forward data about what got locked/pruned.  While that overhead
> is hopefully less expensive than the locks it saved acquiring, it's
> still overhead (and in some cases the patch will fail to save acquiring
> any locks, making it certainly a net negative).
>
> * In a successfully built execution state tree, there will simply
> not be any nodes corresponding to pruned-away, never-locked subplans.
> As long as code like EXPLAIN follows the state tree and doesn't poke
> into plan nodes that have no matching state, it's secure against the
> sort of problems that Robert worried about upthread.

I think this is true with the patch as proposed too, but I was still a
bit worried about what an ExecutorStart_hook may be doing with an
uninitialized plan tree.  Maybe we're mandating that the hook must
call standard_ExecutorStart() and only work with the finished
PlanState tree?

> While I've not attempted to write any code for this, I can also
> think of a few issues that'd have to be resolved:
>
> * We'd be pushing the responsibility for looping back and re-planning
> out to fairly high-level calling code.  There are only half a dozen
> callers of GetCachedPlan, so there's not that many places to be
> touched; but in some of those places the subsequent executor-start call
> is not close by, so that the necessary refactoring might be pretty
> painful.  I doubt there's anything insurmountable, but we'd definitely
> be changing some fundamental APIs.

Yeah.  I suppose mostly the same place that the current patch is
touching to pass around the PartitionPruneResult nodes.

> * In some cases (views, at least) we need to acquire lock on relations
> that aren't directly reflected anywhere in the plan tree.  So there'd
> have to be a separate mechanism for getting those locks and rechecking
> validity afterward.  A list of relevant relation OIDs might be enough
> for that.

Hmm, a list of only the OIDs wouldn't preserve the lock mode, so maybe
a list or bitmapset of the RTIs, something along the lines of
PlannedStmt.minLockRelids in the patch?

It perhaps even makes sense to make a special list in PlannedStmt for
only the views?

> * We currently do ExecCheckPermissions() before initializing the
> plan state tree.  It won't do to check permissions on relations we
> haven't yet locked, so that responsibility would have to be moved.
> Maybe that could also be integrated into the initialization recursion?
> Not sure.

Ah, I remember mentioning moving that into ExecGetRangeTableRelation()
[1], but I guess that misses relations that are not referenced in the
plan tree, such as views.  Though maybe that's not a problem if we
track views separately as mentioned above.

> * In the existing usage of AcquireExecutorLocks, if we do decide
> that the plan is stale then we are able to release all the locks
> we got before we go off and replan.  I'm not certain if that behavior
> needs to be preserved, but if it does then that would require some
> additional bookkeeping in the executor.

I think maybe we'll want to continue to release the existing locks,
because if we don't, it's possible we may keep some locks uselessly if
replanning might lock a different set of relations.

> * This approach is optimizing on the assumption that we usually
> won't need to replan, because if we do then we might waste a fair
> amount of executor startup overhead before discovering we have
> to throw all that state away.  I think that's clearly the right
> way to bet, but perhaps somebody else has a different view.

Not sure if you'd like, because it would still keep the
PartitionPruneResult business, but this will be less of a problem if
we do the initial pruning at the beginning of InitPlan(), followed by
locking, before doing anything else.  We would have initialized the
QueryDesc and the EState, but only minimally.  That also keeps the
PartitionPruneResult business local to the executor.

Would you like me to hack up a PoC or are you already on that?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BHiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA%40mail.gmail.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I had what felt like an epiphany: the whole problem arises because the
>> system is wrongly factored.  We should get rid of AcquireExecutorLocks
>> altogether, allowing the plancache to hand back a generic plan that
>> it's not certain of the validity of, and instead integrate the
>> responsibility for acquiring locks into executor startup.

> Interesting.  The current implementation relies on
> PlanCacheRelCallback() marking a generic CachedPlan as invalid, so
> perhaps there will have to be some sharing of state between the
> plancache and the executor for this to work?

Yeah.  Thinking a little harder, I think this would have to involve
passing a CachedPlan pointer to the executor, and what the executor
would do after acquiring each lock is to ask the plancache "hey, do
you still think this CachedPlan entry is valid?".  In the case where
there's a problem, the AcceptInvalidationMessages call involved in
lock acquisition would lead to a cache inval that clears the validity
flag on the CachedPlan entry, and this would provide an inexpensive
way to check if that happened.

It might be possible to incorporate this pointer into PlannedStmt
instead of passing it separately.

>> * In a successfully built execution state tree, there will simply
>> not be any nodes corresponding to pruned-away, never-locked subplans.

> I think this is true with the patch as proposed too, but I was still a
> bit worried about what an ExecutorStart_hook may be doing with an
> uninitialized plan tree.  Maybe we're mandating that the hook must
> call standard_ExecutorStart() and only work with the finished
> PlanState tree?

It would certainly be incumbent on any such hook to not touch
not-yet-locked parts of the plan tree.  I'm not particularly concerned
about that sort of requirements change, because we'd be breaking APIs
all through this area in any case.

>> * In some cases (views, at least) we need to acquire lock on relations
>> that aren't directly reflected anywhere in the plan tree.  So there'd
>> have to be a separate mechanism for getting those locks and rechecking
>> validity afterward.  A list of relevant relation OIDs might be enough
>> for that.

> Hmm, a list of only the OIDs wouldn't preserve the lock mode,

Good point.  I wonder if we could integrate this with the
RTEPermissionInfo data structure?

> Would you like me to hack up a PoC or are you already on that?

I'm not planning to work on this myself, I was hoping you would.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Fri, Jan 20, 2023 at 4:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I had what felt like an epiphany: the whole problem arises because the
> >> system is wrongly factored.  We should get rid of AcquireExecutorLocks
> >> altogether, allowing the plancache to hand back a generic plan that
> >> it's not certain of the validity of, and instead integrate the
> >> responsibility for acquiring locks into executor startup.
>
> > Interesting.  The current implementation relies on
> > PlanCacheRelCallback() marking a generic CachedPlan as invalid, so
> > perhaps there will have to be some sharing of state between the
> > plancache and the executor for this to work?
>
> Yeah.  Thinking a little harder, I think this would have to involve
> passing a CachedPlan pointer to the executor, and what the executor
> would do after acquiring each lock is to ask the plancache "hey, do
> you still think this CachedPlan entry is valid?".  In the case where
> there's a problem, the AcceptInvalidationMessages call involved in
> lock acquisition would lead to a cache inval that clears the validity
> flag on the CachedPlan entry, and this would provide an inexpensive
> way to check if that happened.

OK, thanks, this is useful.

> It might be possible to incorporate this pointer into PlannedStmt
> instead of passing it separately.

Yeah, that would be less churn.  Though, I wonder if you still hold
that PlannedStmt should not be scribbled upon outside the planner as
you said upthread [1]?

> >> * In a successfully built execution state tree, there will simply
> >> not be any nodes corresponding to pruned-away, never-locked subplans.
>
> > I think this is true with the patch as proposed too, but I was still a
> > bit worried about what an ExecutorStart_hook may be doing with an
> > uninitialized plan tree.  Maybe we're mandating that the hook must
> > call standard_ExecutorStart() and only work with the finished
> > PlanState tree?
>
> It would certainly be incumbent on any such hook to not touch
> not-yet-locked parts of the plan tree.  I'm not particularly concerned
> about that sort of requirements change, because we'd be breaking APIs
> all through this area in any case.

OK.  Perhaps something that should be documented around ExecutorStart().

> >> * In some cases (views, at least) we need to acquire lock on relations
> >> that aren't directly reflected anywhere in the plan tree.  So there'd
> >> have to be a separate mechanism for getting those locks and rechecking
> >> validity afterward.  A list of relevant relation OIDs might be enough
> >> for that.
>
> > Hmm, a list of only the OIDs wouldn't preserve the lock mode,
>
> Good point.  I wonder if we could integrate this with the
> RTEPermissionInfo data structure?

You mean adding a rellockmode field to RTEPermissionInfo?

> > Would you like me to hack up a PoC or are you already on that?
>
> I'm not planning to work on this myself, I was hoping you would.

Alright, I'll try to get something out early next week.  Thanks for
all the pointers.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/922566.1648784745%40sss.pgh.pa.us



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It might be possible to incorporate this pointer into PlannedStmt
>> instead of passing it separately.

> Yeah, that would be less churn.  Though, I wonder if you still hold
> that PlannedStmt should not be scribbled upon outside the planner as
> you said upthread [1]?

Well, the whole point of that rule is that the executor can't modify
a plancache entry.  If the plancache itself sets a field in such an
entry, that doesn't seem problematic from here.

But there's other possibilities if that bothers you; QueryDesc
could hold the field, for example.  Also, I bet we'd want to copy
it into EState for the main initialization recursion.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 20, 2023 at 12:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Fri, Jan 20, 2023 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> It might be possible to incorporate this pointer into PlannedStmt
> >> instead of passing it separately.
>
> > Yeah, that would be less churn.  Though, I wonder if you still hold
> > that PlannedStmt should not be scribbled upon outside the planner as
> > you said upthread [1]?
>
> Well, the whole point of that rule is that the executor can't modify
> a plancache entry.  If the plancache itself sets a field in such an
> entry, that doesn't seem problematic from here.
>
> But there's other possibilities if that bothers you; QueryDesc
> could hold the field, for example.  Also, I bet we'd want to copy
> it into EState for the main initialization recursion.

QueryDesc sounds good to me, and yes, also a copy in EState in any case.

So I started looking at the call sites of CreateQueryDesc() and
stopped to look at ExecParallelGetQueryDesc().  AFAICS, we wouldn't
need to pass the CachedPlan to a parallel worker's rerun of
InitPlan(), because 1) it doesn't make sense to call the plancache in
a parallel worker, 2) the leader should already have taken all the
locks in necessary for executing a given plan subnode that it intends
to pass to a worker in ExecInitGather().  Does that make sense?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Alright, I'll try to get something out early next week.  Thanks for
> all the pointers.

Sorry for the delay.  Attached is what I've come up with so far.

I didn't actually go with calling the plancache on every lock taken on
a relation, that is, in ExecGetRangeTableRelation().  One thing about
doing it that way that I didn't quite like (or didn't see a clean
enough way to code) is the need to complicate the ExecInitNode()
traversal for handling the abrupt suspension of the ongoing setup of
the PlanState tree.

So, I decided to keep the current model of locking all the relations
that need to be locked before doing anything else in InitPlan(), much
as how AcquireExecutorLocks() does it.   A new function called from
the top of InitPlan that I've called ExecLockRelationsIfNeeded() does
that locking after performing the initial pruning in the same manner
as the earlier patch did.  That does mean that I needed to keep all
the adjustments of the pruning code that are required for such
out-of-ExecInitNode() invocation of initial pruning, including those
PartitionPruneResult to carry the result of that pruning for
ExecInitNode()-time reuse, though they no longer need be passed
through many unrelated interfaces.

Anyways, here's a description of the patches:

0001 adjusts various call sites of ExecutorStart() to cope with the
possibility of being asked to recreate a CachedPlan, if one is
involved.  The main objective here is to have as little stuff as
sensible happen between GetCachedPlan() that returned the CachedPlan
and ExecutorStart() so as to minimize the chances of missing cleaning
up resources that must not be missed.

0002 is preparatory refactoring to make out-of-ExecInitNode()
invocation of pruning possible.

0003 moves the responsibility of CachedPlan validation locking into
ExecutorStart() as described above.




--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jan 20, 2023 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > Alright, I'll try to get something out early next week.  Thanks for
> > all the pointers.
>
> Sorry for the delay.  Attached is what I've come up with so far.
>
> I didn't actually go with calling the plancache on every lock taken on
> a relation, that is, in ExecGetRangeTableRelation().  One thing about
> doing it that way that I didn't quite like (or didn't see a clean
> enough way to code) is the need to complicate the ExecInitNode()
> traversal for handling the abrupt suspension of the ongoing setup of
> the PlanState tree.

OK, I gave this one more try and attached is what I came up with.

This adds a ExecPlanStillValid(), which is called right after anything
that may in turn call ExecGetRangeTableRelation() which has been
taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in
EState.es_top_eflags.  That includes all ExecInitNode() calls, and a
few other functions that call ExecGetRangeTableRelation() directly,
such as ExecOpenScanRelation().  If ExecPlanStillValid() returns
false, that is, if EState.es_cachedplan is found to have been
invalidated after a lock being taken by ExecGetRangeTableRelation(),
whatever funcion called it must return immediately and so must its
caller and so on.  ExecEndPlan() seems to be able to clean up after a
partially finished attempt of initializing a PlanState tree in this
way.  Maybe my preliminary testing didn't catch cases where pointers
to resources that are normally put into the nodes of a PlanState tree
are now left dangling, because a partially built PlanState tree is not
accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in
such cases.  Maybe there's only es_tupleTable and es_relations that
needs to be explicitly released and the rest is taken care of by
resetting the ExecutorState context.

On testing, I'm afraid we're going to need something like
src/test/modules/delay_execution to test that concurrent changes to
relation(s) in PlannedStmt.relationOids that occur somewhere between
RevalidateCachedQuery() and InitPlan() result in the latter to be
aborted and that it is handled correctly.  It seems like it is only
the locking of partitions (that are not present in an unplanned Query
and thus not protected by AcquirePlannerLocks()) that can trigger
replanning of a CachedPlan, so any tests we write should involve
partitions.  Should this try to test as many plan shapes as possible
though given the uncertainty around ExecEndPlan() robustness or should
manual auditing suffice to be sure that nothing's broken?

On possibly needing to move permission checking to occur *after*
taking locks, I realized that we don't really need to, because no
relation that needs its permissions should be unlocked by the time we
get to ExecCheckPermissions(); note we only check permissions of
tables that are present in the original parse tree and
RevalidateCachedQuery() should have locked those.  I found a couple of
exceptions to that invariant in that views sometimes appear not to be
in the set of relations that RevalidateCachedQuery() locks.  So, I
invented PlannedStmt.viewRelations, a list of RT indexes of view RTEs
that is populated in setrefs.c. ExecLockViewRelations() called before
ExecCheckPermissions() locks those.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Feb 2, 2023 at 11:49 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Fri, Jan 27, 2023 at 4:01 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I didn't actually go with calling the plancache on every lock taken on
> > a relation, that is, in ExecGetRangeTableRelation().  One thing about
> > doing it that way that I didn't quite like (or didn't see a clean
> > enough way to code) is the need to complicate the ExecInitNode()
> > traversal for handling the abrupt suspension of the ongoing setup of
> > the PlanState tree.
>
> OK, I gave this one more try and attached is what I came up with.
>
> This adds a ExecPlanStillValid(), which is called right after anything
> that may in turn call ExecGetRangeTableRelation() which has been
> taught to lock a relation if EXEC_FLAG_GET_LOCKS has been passed in
> EState.es_top_eflags.  That includes all ExecInitNode() calls, and a
> few other functions that call ExecGetRangeTableRelation() directly,
> such as ExecOpenScanRelation().  If ExecPlanStillValid() returns
> false, that is, if EState.es_cachedplan is found to have been
> invalidated after a lock being taken by ExecGetRangeTableRelation(),
> whatever funcion called it must return immediately and so must its
> caller and so on.  ExecEndPlan() seems to be able to clean up after a
> partially finished attempt of initializing a PlanState tree in this
> way.  Maybe my preliminary testing didn't catch cases where pointers
> to resources that are normally put into the nodes of a PlanState tree
> are now left dangling, because a partially built PlanState tree is not
> accessible to ExecEndPlan; QueryDesc.planstate would remain NULL in
> such cases.  Maybe there's only es_tupleTable and es_relations that
> needs to be explicitly released and the rest is taken care of by
> resetting the ExecutorState context.

In the attached updated patch, I've made the functions that check
ExecPlanStillValid() to return NULL (if returning something) instead
of returning partially initialized structs.  Those partially
initialized structs were not being subsequently looked at anyway.

> On testing, I'm afraid we're going to need something like
> src/test/modules/delay_execution to test that concurrent changes to
> relation(s) in PlannedStmt.relationOids that occur somewhere between
> RevalidateCachedQuery() and InitPlan() result in the latter to be
> aborted and that it is handled correctly.  It seems like it is only
> the locking of partitions (that are not present in an unplanned Query
> and thus not protected by AcquirePlannerLocks()) that can trigger
> replanning of a CachedPlan, so any tests we write should involve
> partitions.  Should this try to test as many plan shapes as possible
> though given the uncertainty around ExecEndPlan() robustness or should
> manual auditing suffice to be sure that nothing's broken?

I've added a test case under src/modules/delay_execution by adding a
new ExecutorStart_hook that works similarly as
delay_execution_planner().  The test works by allowing a concurrent
session to drop an object being referenced in a cached plan being
initialized while the ExecutorStart_hook waits to get an advisory
lock.  The concurrent drop of the referenced object is detected during
ExecInitNode() and thus triggers replanning of the cached plan.

I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Andres Freund
Date:
Hi,

On 2023-02-03 22:01:09 +0900, Amit Langote wrote:
> I've added a test case under src/modules/delay_execution by adding a
> new ExecutorStart_hook that works similarly as
> delay_execution_planner().  The test works by allowing a concurrent
> session to drop an object being referenced in a cached plan being
> initialized while the ExecutorStart_hook waits to get an advisory
> lock.  The concurrent drop of the referenced object is detected during
> ExecInitNode() and thus triggers replanning of the cached plan.
> 
> I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

The tests seem to frequently hang on freebsd:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478

Greetings,

Andres Freund



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-02-03 22:01:09 +0900, Amit Langote wrote:
> I've added a test case under src/modules/delay_execution by adding a
> new ExecutorStart_hook that works similarly as
> delay_execution_planner().  The test works by allowing a concurrent
> session to drop an object being referenced in a cached plan being
> initialized while the ExecutorStart_hook waits to get an advisory
> lock.  The concurrent drop of the referenced object is detected during
> ExecInitNode() and thus triggers replanning of the cached plan.
>
> I also fixed a bug in the ExplainExecuteQuery() while testing and some comments.

The tests seem to frequently hang on freebsd:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478

Thanks for the heads up.  I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for some other failures (on other cirrus machines).  Has anyone else been a similar situation?
--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Feb 8, 2023 at 7:31 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Feb 7, 2023 at 23:38 Andres Freund <andres@anarazel.de> wrote:
>> The tests seem to frequently hang on freebsd:
>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3478
>
> Thanks for the heads up.  I’ve noticed this one too, though couldn’t find the testrun artifacts like I could get for
someother failures (on other cirrus machines).  Has anyone else been a similar situation? 

I think I have figured out what might be going wrong on that cfbot
animal after building with the same CPPFLAGS as that animal locally.
I had forgotten to update _out/_readRangeTblEntry() to account for the
patch's change that a view's RTE_SUBQUERY now also preserves relkind
in addition to relid and rellockmode for the locking consideration.

Also, I noticed that a multi-query Portal execution with rules was
failing (thanks to a regression test added in a7d71c41db) because of
the snapshot used for the 2nd query onward not being updated for
command ID change under patched model of multi-query Portal execution.
To wit, under the patched model, all queries in the multi-query Portal
case undergo ExecutorStart() before any of it is run with
ExecutorRun().  The patch hadn't changed things however to update the
snapshot's command ID for the 2nd query onwards, which caused the
aforementioned test case to fail.

This new model does however mean that the 2nd query onwards must use
PushCopiedSnapshot() given the current requirement of
UpdateActiveSnapshotCommandId() that the snapshot passed to it must
not be referenced anywhere else.  The new model basically requires
that each query's QueryDesc points to its own copy of the
ActiveSnapshot.  That may not be a thing in favor of the patched model
though.  For now, I haven't been able to come up with a better
alternative.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> I think I have figured out what might be going wrong on that cfbot
> animal after building with the same CPPFLAGS as that animal locally.
> I had forgotten to update _out/_readRangeTblEntry() to account for the
> patch's change that a view's RTE_SUBQUERY now also preserves relkind
> in addition to relid and rellockmode for the locking consideration.
>
> Also, I noticed that a multi-query Portal execution with rules was
> failing (thanks to a regression test added in a7d71c41db) because of
> the snapshot used for the 2nd query onward not being updated for
> command ID change under patched model of multi-query Portal execution.
> To wit, under the patched model, all queries in the multi-query Portal
> case undergo ExecutorStart() before any of it is run with
> ExecutorRun().  The patch hadn't changed things however to update the
> snapshot's command ID for the 2nd query onwards, which caused the
> aforementioned test case to fail.
>
> This new model does however mean that the 2nd query onwards must use
> PushCopiedSnapshot() given the current requirement of
> UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> not be referenced anywhere else.  The new model basically requires
> that each query's QueryDesc points to its own copy of the
> ActiveSnapshot.  That may not be a thing in favor of the patched model
> though.  For now, I haven't been able to come up with a better
> alternative.

Here's a new version addressing the following 2 points.

* Like views, I realized that non-leaf relations of partition trees
scanned by an Append/MergeAppend would need to be locked separately,
because ExecInitNode() traversal of the plan tree would not account
for them.  That is, they are not opened using
ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
is that some (if not all) of those non-leaf relations may be
referenced in PartitionPruneInfo and so locked as part of initializing
the corresponding PartitionPruneState, but I decided not to complicate
the code to filter out such relations from the set locked separately.
To carry the set of relations to lock, the refactoring patch 0001
re-introduces the List of Bitmapset field named allpartrelids into
Append/MergeAppend nodes, which we had previously removed on the
grounds that those relations need not be locked separately (commits
f2343653f5b, f003a7522bf).

* I decided to initialize QueryDesc.planstate even in the cases where
ExecInitNode() traversal is aborted in the middle on detecting
CachedPlan invalidation such that it points to a partially initialized
PlanState tree.  My earlier thinking that each PlanState node need not
be visited for resource cleanup in such cases was naive after all.  To
that end, I've fixed the ExecEndNode() subroutines of all Plan node
types to account for potentially uninitialized fields.  There are a
couple of cases where I'm a bit doubtful though.  In
ExecEndCustomScan(), there's no indication in CustomScanState whether
it's OK to call EndCustomScan() when BeginCustomScan() may not have
been called.  For ForeignScanState, I've assumed that
ForeignScanState.fdw_state being set can be used as a marker that
BeginForeignScan would have been called, though maybe that's not a
solid assumption.

I'm also attaching a new (small) patch 0003 that eliminates the
loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
iterating over a new List field of EState named es_opened_relations,
which is populated by ExecGetRangeTableRelation() with only the
relations that were opened.  This speeds up
ExecCloseRangeTableRelations() significantly for the cases with many
runtime-prunable partitions.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I think I have figured out what might be going wrong on that cfbot
> > animal after building with the same CPPFLAGS as that animal locally.
> > I had forgotten to update _out/_readRangeTblEntry() to account for the
> > patch's change that a view's RTE_SUBQUERY now also preserves relkind
> > in addition to relid and rellockmode for the locking consideration.
> >
> > Also, I noticed that a multi-query Portal execution with rules was
> > failing (thanks to a regression test added in a7d71c41db) because of
> > the snapshot used for the 2nd query onward not being updated for
> > command ID change under patched model of multi-query Portal execution.
> > To wit, under the patched model, all queries in the multi-query Portal
> > case undergo ExecutorStart() before any of it is run with
> > ExecutorRun().  The patch hadn't changed things however to update the
> > snapshot's command ID for the 2nd query onwards, which caused the
> > aforementioned test case to fail.
> >
> > This new model does however mean that the 2nd query onwards must use
> > PushCopiedSnapshot() given the current requirement of
> > UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> > not be referenced anywhere else.  The new model basically requires
> > that each query's QueryDesc points to its own copy of the
> > ActiveSnapshot.  That may not be a thing in favor of the patched model
> > though.  For now, I haven't been able to come up with a better
> > alternative.
>
> Here's a new version addressing the following 2 points.
>
> * Like views, I realized that non-leaf relations of partition trees
> scanned by an Append/MergeAppend would need to be locked separately,
> because ExecInitNode() traversal of the plan tree would not account
> for them.  That is, they are not opened using
> ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
> is that some (if not all) of those non-leaf relations may be
> referenced in PartitionPruneInfo and so locked as part of initializing
> the corresponding PartitionPruneState, but I decided not to complicate
> the code to filter out such relations from the set locked separately.
> To carry the set of relations to lock, the refactoring patch 0001
> re-introduces the List of Bitmapset field named allpartrelids into
> Append/MergeAppend nodes, which we had previously removed on the
> grounds that those relations need not be locked separately (commits
> f2343653f5b, f003a7522bf).
>
> * I decided to initialize QueryDesc.planstate even in the cases where
> ExecInitNode() traversal is aborted in the middle on detecting
> CachedPlan invalidation such that it points to a partially initialized
> PlanState tree.  My earlier thinking that each PlanState node need not
> be visited for resource cleanup in such cases was naive after all.  To
> that end, I've fixed the ExecEndNode() subroutines of all Plan node
> types to account for potentially uninitialized fields.  There are a
> couple of cases where I'm a bit doubtful though.  In
> ExecEndCustomScan(), there's no indication in CustomScanState whether
> it's OK to call EndCustomScan() when BeginCustomScan() may not have
> been called.  For ForeignScanState, I've assumed that
> ForeignScanState.fdw_state being set can be used as a marker that
> BeginForeignScan would have been called, though maybe that's not a
> solid assumption.
>
> I'm also attaching a new (small) patch 0003 that eliminates the
> loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
> iterating over a new List field of EState named es_opened_relations,
> which is populated by ExecGetRangeTableRelation() with only the
> relations that were opened.  This speeds up
> ExecCloseRangeTableRelations() significantly for the cases with many
> runtime-prunable partitions.

Here's another version with some cosmetic changes, like fixing some
factually incorrect / obsolete comments and typos that I found.  I
also noticed that I had missed noting near some table_open() that
locks taken with those can't possibly invalidate a plan (such as
lazily opened partition routing target partitions) and thus need the
treatment that locking during execution initialization requires.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Mar 22, 2023 at 9:48 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Thu, Mar 2, 2023 at 10:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > I think I have figured out what might be going wrong on that cfbot
> > > animal after building with the same CPPFLAGS as that animal locally.
> > > I had forgotten to update _out/_readRangeTblEntry() to account for the
> > > patch's change that a view's RTE_SUBQUERY now also preserves relkind
> > > in addition to relid and rellockmode for the locking consideration.
> > >
> > > Also, I noticed that a multi-query Portal execution with rules was
> > > failing (thanks to a regression test added in a7d71c41db) because of
> > > the snapshot used for the 2nd query onward not being updated for
> > > command ID change under patched model of multi-query Portal execution.
> > > To wit, under the patched model, all queries in the multi-query Portal
> > > case undergo ExecutorStart() before any of it is run with
> > > ExecutorRun().  The patch hadn't changed things however to update the
> > > snapshot's command ID for the 2nd query onwards, which caused the
> > > aforementioned test case to fail.
> > >
> > > This new model does however mean that the 2nd query onwards must use
> > > PushCopiedSnapshot() given the current requirement of
> > > UpdateActiveSnapshotCommandId() that the snapshot passed to it must
> > > not be referenced anywhere else.  The new model basically requires
> > > that each query's QueryDesc points to its own copy of the
> > > ActiveSnapshot.  That may not be a thing in favor of the patched model
> > > though.  For now, I haven't been able to come up with a better
> > > alternative.
> >
> > Here's a new version addressing the following 2 points.
> >
> > * Like views, I realized that non-leaf relations of partition trees
> > scanned by an Append/MergeAppend would need to be locked separately,
> > because ExecInitNode() traversal of the plan tree would not account
> > for them.  That is, they are not opened using
> > ExecGetRangeTableRelation() or ExecOpenScanRelation().  One exception
> > is that some (if not all) of those non-leaf relations may be
> > referenced in PartitionPruneInfo and so locked as part of initializing
> > the corresponding PartitionPruneState, but I decided not to complicate
> > the code to filter out such relations from the set locked separately.
> > To carry the set of relations to lock, the refactoring patch 0001
> > re-introduces the List of Bitmapset field named allpartrelids into
> > Append/MergeAppend nodes, which we had previously removed on the
> > grounds that those relations need not be locked separately (commits
> > f2343653f5b, f003a7522bf).
> >
> > * I decided to initialize QueryDesc.planstate even in the cases where
> > ExecInitNode() traversal is aborted in the middle on detecting
> > CachedPlan invalidation such that it points to a partially initialized
> > PlanState tree.  My earlier thinking that each PlanState node need not
> > be visited for resource cleanup in such cases was naive after all.  To
> > that end, I've fixed the ExecEndNode() subroutines of all Plan node
> > types to account for potentially uninitialized fields.  There are a
> > couple of cases where I'm a bit doubtful though.  In
> > ExecEndCustomScan(), there's no indication in CustomScanState whether
> > it's OK to call EndCustomScan() when BeginCustomScan() may not have
> > been called.  For ForeignScanState, I've assumed that
> > ForeignScanState.fdw_state being set can be used as a marker that
> > BeginForeignScan would have been called, though maybe that's not a
> > solid assumption.
> >
> > I'm also attaching a new (small) patch 0003 that eliminates the
> > loop-over-rangetable in ExecCloseRangeTableRelations() in favor of
> > iterating over a new List field of EState named es_opened_relations,
> > which is populated by ExecGetRangeTableRelation() with only the
> > relations that were opened.  This speeds up
> > ExecCloseRangeTableRelations() significantly for the cases with many
> > runtime-prunable partitions.
>
> Here's another version with some cosmetic changes, like fixing some
> factually incorrect / obsolete comments and typos that I found.  I
> also noticed that I had missed noting near some table_open() that
> locks taken with those can't possibly invalidate a plan (such as
> lazily opened partition routing target partitions) and thus need the
> treatment that locking during execution initialization requires.

Rebased over 3c05284d83b2 ("Invent GENERIC_PLAN option for EXPLAIN.").

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
> > On Tue, Mar 14, 2023 at 7:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > * I decided to initialize QueryDesc.planstate even in the cases where
> > > ExecInitNode() traversal is aborted in the middle on detecting
> > > CachedPlan invalidation such that it points to a partially initialized
> > > PlanState tree.  My earlier thinking that each PlanState node need not
> > > be visited for resource cleanup in such cases was naive after all.  To
> > > that end, I've fixed the ExecEndNode() subroutines of all Plan node
> > > types to account for potentially uninitialized fields.  There are a
> > > couple of cases where I'm a bit doubtful though.  In
> > > ExecEndCustomScan(), there's no indication in CustomScanState whether
> > > it's OK to call EndCustomScan() when BeginCustomScan() may not have
> > > been called.  For ForeignScanState, I've assumed that
> > > ForeignScanState.fdw_state being set can be used as a marker that
> > > BeginForeignScan would have been called, though maybe that's not a
> > > solid assumption.

It seems I hadn't noted in the ExecEndNode()'s comment that all node
types' recursive subroutines need to  handle the change made by this
patch that the corresponding ExecInitNode() subroutine may now return
early without having initialized all state struct fields.

Also noted in the documentation for CustomScan and ForeignScan that
the Begin*Scan callback may not have been called at all, so the
End*Scan should handle that gracefully.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> [ v38 patchset ]

I spent a little bit of time looking through this, and concluded that
it's not something I will be wanting to push into v16 at this stage.
The patch doesn't seem very close to being committable on its own
terms, and even if it was now is not a great time in the dev cycle
to be making significant executor API changes.  Too much risk of
having to thrash the API during beta, or even change it some more
in v17.  I suggest that we push this forward to the next CF with the
hope of landing it early in v17.

A few concrete thoughts:

* I understand that your plan now is to acquire locks on all the
originally-named tables, then do permissions checks (which will
involve only those tables), then dynamically lock just inheritance and
partitioning child tables as we descend the plan tree.  That seems
more or less okay to me, but it could be reflected better in the
structure of the patch perhaps.

* In particular I don't much like the "viewRelations" list, which
seems like a wart; those ought to be handled more nearly the same way
as other RTEs.  (One concrete reason why is that this scheme is going
to result in locking views in a different order than they were locked
during original parsing, which perhaps could contribute to deadlocks.)
Maybe we should store an integer list of which RTIs need to be locked
in the early phase?  Building that in the parser/rewriter would provide
a solid guide to the original locking order, so we'd be trivially sure
of duplicating that.  (It might be close enough to follow the RT list
order, which is basically what AcquireExecutorLocks does today, but
this'd be more certain to do the right thing.)  I'm less concerned
about lock order for child tables because those are just going to
follow the inheritance or partitioning structure.

* I don't understand the need for changes like this:

     /* clean up tuple table */
-    ExecClearTuple(node->ps.ps_ResultTupleSlot);
+    if (node->ps.ps_ResultTupleSlot)
+        ExecClearTuple(node->ps.ps_ResultTupleSlot);

ISTM that the process ought to involve taking a lock (if needed)
before we have built any execution state for a given plan node,
and if we find we have to fail, returning NULL instead of a
partially-valid planstate node.  Otherwise, considerations of how
to handle partially-valid nodes are going to metastasize into all
sorts of places, almost certainly including EXPLAIN for instance.
I think we ought to be able to limit the damage to "parent nodes
might have NULL child links that you wouldn't have expected".
That wouldn't faze ExecEndNode at all, nor most other code.

* More attention is needed to comments.  For example, in a couple of
places in plancache.c you have removed function header comments
defining API details and not replaced them with any info about the new
details, despite the fact that those details are more complex than the
old.

> It seems I hadn't noted in the ExecEndNode()'s comment that all node
> types' recursive subroutines need to  handle the change made by this
> patch that the corresponding ExecInitNode() subroutine may now return
> early without having initialized all state struct fields.
> Also noted in the documentation for CustomScan and ForeignScan that
> the Begin*Scan callback may not have been called at all, so the
> End*Scan should handle that gracefully.

Yeah, I think we need to avoid adding such requirements.  It's the
sort of thing that would far too easily get past developer testing
and only fail once in a blue moon in the field.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > [ v38 patchset ]
>
> I spent a little bit of time looking through this, and concluded that
> it's not something I will be wanting to push into v16 at this stage.
> The patch doesn't seem very close to being committable on its own
> terms, and even if it was now is not a great time in the dev cycle
> to be making significant executor API changes.  Too much risk of
> having to thrash the API during beta, or even change it some more
> in v17.  I suggest that we push this forward to the next CF with the
> hope of landing it early in v17.

OK, thanks a lot for your feedback.

> A few concrete thoughts:
>
> * I understand that your plan now is to acquire locks on all the
> originally-named tables, then do permissions checks (which will
> involve only those tables), then dynamically lock just inheritance and
> partitioning child tables as we descend the plan tree.

Actually, with the current implementation of the patch, *all* of the
relations mentioned in the plan tree would get locked during the
ExecInitNode() traversal of the plan tree (and of those in
plannedstmt->subplans), not just the inheritance child tables.
Locking of non-child tables done by the executor after this patch is
duplicative with AcquirePlannerLocks(), so that's something to be
improved.

> That seems
> more or less okay to me, but it could be reflected better in the
> structure of the patch perhaps.
>
> * In particular I don't much like the "viewRelations" list, which
> seems like a wart; those ought to be handled more nearly the same way
> as other RTEs.  (One concrete reason why is that this scheme is going
> to result in locking views in a different order than they were locked
> during original parsing, which perhaps could contribute to deadlocks.)
> Maybe we should store an integer list of which RTIs need to be locked
> in the early phase?  Building that in the parser/rewriter would provide
> a solid guide to the original locking order, so we'd be trivially sure
> of duplicating that.  (It might be close enough to follow the RT list
> order, which is basically what AcquireExecutorLocks does today, but
> this'd be more certain to do the right thing.)  I'm less concerned
> about lock order for child tables because those are just going to
> follow the inheritance or partitioning structure.

What you've described here sounds somewhat like what I had implemented
in the patch versions till v31, though it used a bitmapset named
minLockRelids that is initialized by setrefs.c.  Your idea of
initializing a list before planning seems more appealing offhand than
the code I had added in setrefs.c to populate that minLockRelids
bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)),
followed by bms_del_members(set-of-child-rel-rtis).

I'll give your idea a try.

> * I don't understand the need for changes like this:
>
>         /* clean up tuple table */
> -       ExecClearTuple(node->ps.ps_ResultTupleSlot);
> +       if (node->ps.ps_ResultTupleSlot)
> +               ExecClearTuple(node->ps.ps_ResultTupleSlot);
>
> ISTM that the process ought to involve taking a lock (if needed)
> before we have built any execution state for a given plan node,
> and if we find we have to fail, returning NULL instead of a
> partially-valid planstate node.  Otherwise, considerations of how
> to handle partially-valid nodes are going to metastasize into all
> sorts of places, almost certainly including EXPLAIN for instance.
> I think we ought to be able to limit the damage to "parent nodes
> might have NULL child links that you wouldn't have expected".
> That wouldn't faze ExecEndNode at all, nor most other code.

Hmm, yes, taking a lock before allocating any of the stuff to add into
the planstate seems like it's much easier to reason about than the
alternative I've implemented.

> * More attention is needed to comments.  For example, in a couple of
> places in plancache.c you have removed function header comments
> defining API details and not replaced them with any info about the new
> details, despite the fact that those details are more complex than the
> old.

OK, yeah, maybe I've added a bunch of explanations in execMain.c that
should perhaps have been in plancache.c.

> > It seems I hadn't noted in the ExecEndNode()'s comment that all node
> > types' recursive subroutines need to  handle the change made by this
> > patch that the corresponding ExecInitNode() subroutine may now return
> > early without having initialized all state struct fields.
> > Also noted in the documentation for CustomScan and ForeignScan that
> > the Begin*Scan callback may not have been called at all, so the
> > End*Scan should handle that gracefully.
>
> Yeah, I think we need to avoid adding such requirements.  It's the
> sort of thing that would far too easily get past developer testing
> and only fail once in a blue moon in the field.

OK, got it.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Apr 4, 2023 at 10:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Apr 4, 2023 at 6:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A few concrete thoughts:
> >
> > * I understand that your plan now is to acquire locks on all the
> > originally-named tables, then do permissions checks (which will
> > involve only those tables), then dynamically lock just inheritance and
> > partitioning child tables as we descend the plan tree.
>
> Actually, with the current implementation of the patch, *all* of the
> relations mentioned in the plan tree would get locked during the
> ExecInitNode() traversal of the plan tree (and of those in
> plannedstmt->subplans), not just the inheritance child tables.
> Locking of non-child tables done by the executor after this patch is
> duplicative with AcquirePlannerLocks(), so that's something to be
> improved.
>
> > That seems
> > more or less okay to me, but it could be reflected better in the
> > structure of the patch perhaps.
> >
> > * In particular I don't much like the "viewRelations" list, which
> > seems like a wart; those ought to be handled more nearly the same way
> > as other RTEs.  (One concrete reason why is that this scheme is going
> > to result in locking views in a different order than they were locked
> > during original parsing, which perhaps could contribute to deadlocks.)
> > Maybe we should store an integer list of which RTIs need to be locked
> > in the early phase?  Building that in the parser/rewriter would provide
> > a solid guide to the original locking order, so we'd be trivially sure
> > of duplicating that.  (It might be close enough to follow the RT list
> > order, which is basically what AcquireExecutorLocks does today, but
> > this'd be more certain to do the right thing.)  I'm less concerned
> > about lock order for child tables because those are just going to
> > follow the inheritance or partitioning structure.
>
> What you've described here sounds somewhat like what I had implemented
> in the patch versions till v31, though it used a bitmapset named
> minLockRelids that is initialized by setrefs.c.  Your idea of
> initializing a list before planning seems more appealing offhand than
> the code I had added in setrefs.c to populate that minLockRelids
> bitmapset, which would be bms_add_range(1, list_lenth(finalrtable)),
> followed by bms_del_members(set-of-child-rel-rtis).
>
> I'll give your idea a try.

After sleeping on this, I think we perhaps don't need to remember originally-named relations if only for the purpose of locking them for execution.  That's because, for a reused (cached) plan, AcquirePlannerLocks() would have taken those locks anyway.

AcquirePlannerLocks() doesn't lock inheritance children because they would be added to the range table by the planner, so they should be locked separately for execution, if needed.  I thought taking the execution-time locks only when inside ExecInit[Merge]Append would work, but then we have cases where single-child Append/MergeAppend are stripped of the Append/MergeAppend nodes by setrefs.c.  Maybe we need a place to remember such child relations, that is, only in the cases where Append/MergeAppend elision occurs, in something maybe esoteric-sounding like PlannedStmt.elidedAppendChildRels or something?

Another set of child relations that are not covered by Append/MergeAppend child nodes is non-leaf partitions.  I've proposed adding a List of Bitmapset field to Append/MergeAppend named 'allpartrelids' as part of this patchset (patch 0001) to track those for execution-time locking.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com
--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Here is a new version.  Summary of main changes since the last version
that Tom reviewed back in April:

* ExecInitNode() subroutines now return NULL (as opposed to a
partially initialized PlanState node as in the last version) upon
detecting that the CachedPlan that the plan tree is from is no longer
valid due to invalidation messages processed upon taking locks.  Plan
tree subnodes that are fully initialized till the point of detection
are added by ExecInitNode() into a List in EState called
es_inited_plannodes.  ExecEndPlan() now iterates over that list to
close each one individually using ExecEndNode().  ExecEndNode() or its
subroutines thus no longer need to be recursive to close the child
nodes.  Also, with this design, there is no longer the possibility of
partially initialized PlanState trees with partially initialized
individual PlanState nodes, so the ExecEndNode() subroutine changes
that were in the last version to account for partial initialization
are not necessary.

* Instead of setting EXEC_FLAG_GET_LOCKS in es_top_eflags for the
entire duration of InitPlan(), it is now only set in ExecInitAppend()
and ExecInitMergeAppend(), because that's where the subnodes scanning
child tables would be and the executor only needs to lock child tables
to validate a CachedPlan in a race-free manner.  Parent tables that
appear in the query would have been locked by AcquirePlannerLocks().
Child tables whose scan subnodes don't appear under Append/MergeAppend
(due to the latter being removed by setrefs.c for there being only a
single child) are identified in PlannedStmt.elidedAppendChildRelations
and InitPlan() locks each one found there if the plan tree is from a
CachedPlan.

* There's no longer PlannedStmt.viewRelations, because view relations
need not be tracked separately for locking as AcquirePlannerLocks()
covers them.

Attachment

Re: generic plans and "initial" pruning

From
Daniel Gustafsson
Date:
> On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> 
> Here is a new version.

The local planstate variable in the hunk below is shadowing the function
parameter planstate which cause a compiler warning:

@@ -1495,18 +1556,15 @@ ExecEndPlan(PlanState *planstate, EState *estate)
     ListCell   *l;
 
     /*
-     * shut down the node-type-specific query processing
-     */
-    ExecEndNode(planstate);
-
-    /*
-     * for subplans too
+     * Shut down the node-type-specific query processing for all nodes that
+     * were initialized during InitPlan(), both in the main plan tree and those
+     * in subplans (es_subplanstates), if any.
      */
-    foreach(l, estate->es_subplanstates)
+    foreach(l, estate->es_inited_plannodes)
     {
-        PlanState  *subplanstate = (PlanState *) lfirst(l);
+        PlanState  *planstate = (PlanState *) lfirst(l);

--
Daniel Gustafsson




Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> >
> > Here is a new version.
>
> The local planstate variable in the hunk below is shadowing the function
> parameter planstate which cause a compiler warning:

Thanks Daniel for the heads up.

Attached new version fixes that and contains a few other notable
changes.  Before going into the details of those changes, let me
reiterate in broad strokes what the patch is trying to do.

The idea is to move the locking of some tables referenced in a cached
(generic) plan from plancache/GetCachedPlan() to the
executor/ExecutorStart().  Specifically, the locking of inheritance
child tables.  Why?  Because partition pruning with "initial pruning
steps" contained in the Append/MergeAppend nodes may eliminate some
child tables that need not have been locked to begin with, though the
pruning can only occur during ExecutorStart().

After applying this patch, GetCachedPlan() only locks the tables that
are directly mentioned in the query to ensure that the
analyzed-rewritten-but-unplanned query tree backing a given CachedPlan
is still valid (cf RevalidateCachedQuery()), but not the tables in the
CachedPlan that would have been added by the planner.  Tables in a
CachePlan that would not be locked currently only include the
inheritance child tables / partitions of the tables mentioned in the
query.  This means that the plan trees in a given CachedPlan returned
by GetCachedPlan() are only partially valid and are subject to
invalidation because concurrent sessions can possibly modify the child
tables referenced in them before ExecutorStart() gets around to
locking them.  If the concurrent modifications do happen,
ExecutorStart() is now equipped to detect them by way of noticing that
the CachedPlan is invalidated and inform the caller to discard and
recreate the CachedPlan.  This entails changing all the call sites of
ExecutorStart() that pass it a plan tree from a CachedPlan to
implement the replan-and-retry-execution loop.

Given the above, ExecutorStart(), which has not needed so far to take
any locks (except on indexes mentioned in IndexScans), now needs to
lock child tables if executing a cached plan which contains them.  In
the previous versions, the patch used a flag passed in
EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the
table.  The flag would be set in ExecInitAppend() and
ExecInitMergeAppend() for the duration of the loop that initializes
child subplans with the assumption that that's where the child tables
would be opened.  But not all child subplans of Append/MergeAppend
scan child tables (think UNION ALL queries), so this approach can
result in redundant locking.  Worse, I needed to invent
PlannedStmt.elidedAppendChildRelations to separately track child
tables whose Scan nodes' parent Append/MergeAppend would be removed by
setrefs.c in some cases.

So, this new patch uses a flag in the RangeTblEntry itself to denote
if the table is a child table instead of the above roundabout way.
ExecGetRangeTableRelation() can simply look at the RTE to decide
whether to take a lock or not.  I considered adding a new bool field,
but noticed we already have inFromCl to track if a given RTE is for
table/entity directly mentioned in the query or for something added
behind-the-scenes into the range table as the field's description in
parsenodes.h says.  RTEs for child tables are added behind-the-scenes
by the planner and it makes perfect sense to me to mark their inFromCl
as false.  I can't find anything that relies on the current behavior
of inFromCl being set to the same value as the root inheritance parent
(true).  Patch 0002 makes this change for child RTEs.

A few other notes:

* A parallel worker does ExecutorStart() without access to the
CachedPlan that the leader may have gotten its plan tree from.  This
means that parallel workers do not have the ability to detect plan
tree invalidations.  I think that's fine, because if the leader would
have been able to launch workers at all, it would also have gotten all
the locks to protect the (portion of) the plan tree that the workers
would be executing.  I had an off-list discussion about this with
Robert and he mentioned his concern that each parallel worker would
have its own view of which child subplans of a parallel Append are
"valid" that depends on the result of its own evaluation of initial
pruning.   So, there may be race conditions whereby a worker may try
to execute plan nodes that are no longer valid, for example, if the
partition a worker considers valid is not viewed as such by the leader
and thus not locked.  I shared my thoughts as to why that sounds
unlikely at [1], though maybe I'm a bit too optimistic?

* For multi-query portals, you can't now do ExecutorStart()
immediately followed by ExecutorRun() for each query in the portal,
because ExecutorStart() may now fail to start a plan if it gets
invalidated.   So PortalStart() now does ExecutorStart()s for all
queries and remembers the QueryDescs for PortalRun() then to do
ExecutorRun()s using.  A consequence of this is that
CommandCounterIncrement() now must be done between the
ExecutorStart()s of the individual plans in PortalStart() and not
between the ExecutorRun()s in PortalRunMulti().  make check-world
passes with this new arrangement, though I'm not entirely confident
that there are no problems lurking.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://postgr/es/m/CA+HiwqFA=swkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw@mail.gmail.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Jul 6, 2023 at 11:29 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Jul 3, 2023 at 10:27 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > > On 8 Jun 2023, at 16:23, Amit Langote <amitlangote09@gmail.com> wrote:
> > > Here is a new version.
> >
> > The local planstate variable in the hunk below is shadowing the function
> > parameter planstate which cause a compiler warning:
>
> Thanks Daniel for the heads up.
>
> Attached new version fixes that and contains a few other notable
> changes.  Before going into the details of those changes, let me
> reiterate in broad strokes what the patch is trying to do.
>
> The idea is to move the locking of some tables referenced in a cached
> (generic) plan from plancache/GetCachedPlan() to the
> executor/ExecutorStart().  Specifically, the locking of inheritance
> child tables.  Why?  Because partition pruning with "initial pruning
> steps" contained in the Append/MergeAppend nodes may eliminate some
> child tables that need not have been locked to begin with, though the
> pruning can only occur during ExecutorStart().
>
> After applying this patch, GetCachedPlan() only locks the tables that
> are directly mentioned in the query to ensure that the
> analyzed-rewritten-but-unplanned query tree backing a given CachedPlan
> is still valid (cf RevalidateCachedQuery()), but not the tables in the
> CachedPlan that would have been added by the planner.  Tables in a
> CachePlan that would not be locked currently only include the
> inheritance child tables / partitions of the tables mentioned in the
> query.  This means that the plan trees in a given CachedPlan returned
> by GetCachedPlan() are only partially valid and are subject to
> invalidation because concurrent sessions can possibly modify the child
> tables referenced in them before ExecutorStart() gets around to
> locking them.  If the concurrent modifications do happen,
> ExecutorStart() is now equipped to detect them by way of noticing that
> the CachedPlan is invalidated and inform the caller to discard and
> recreate the CachedPlan.  This entails changing all the call sites of
> ExecutorStart() that pass it a plan tree from a CachedPlan to
> implement the replan-and-retry-execution loop.
>
> Given the above, ExecutorStart(), which has not needed so far to take
> any locks (except on indexes mentioned in IndexScans), now needs to
> lock child tables if executing a cached plan which contains them.  In
> the previous versions, the patch used a flag passed in
> EState.es_top_eflags to signal ExecGetRangeTableRelation() to lock the
> table.  The flag would be set in ExecInitAppend() and
> ExecInitMergeAppend() for the duration of the loop that initializes
> child subplans with the assumption that that's where the child tables
> would be opened.  But not all child subplans of Append/MergeAppend
> scan child tables (think UNION ALL queries), so this approach can
> result in redundant locking.  Worse, I needed to invent
> PlannedStmt.elidedAppendChildRelations to separately track child
> tables whose Scan nodes' parent Append/MergeAppend would be removed by
> setrefs.c in some cases.
>
> So, this new patch uses a flag in the RangeTblEntry itself to denote
> if the table is a child table instead of the above roundabout way.
> ExecGetRangeTableRelation() can simply look at the RTE to decide
> whether to take a lock or not.  I considered adding a new bool field,
> but noticed we already have inFromCl to track if a given RTE is for
> table/entity directly mentioned in the query or for something added
> behind-the-scenes into the range table as the field's description in
> parsenodes.h says.  RTEs for child tables are added behind-the-scenes
> by the planner and it makes perfect sense to me to mark their inFromCl
> as false.  I can't find anything that relies on the current behavior
> of inFromCl being set to the same value as the root inheritance parent
> (true).  Patch 0002 makes this change for child RTEs.
>
> A few other notes:
>
> * A parallel worker does ExecutorStart() without access to the
> CachedPlan that the leader may have gotten its plan tree from.  This
> means that parallel workers do not have the ability to detect plan
> tree invalidations.  I think that's fine, because if the leader would
> have been able to launch workers at all, it would also have gotten all
> the locks to protect the (portion of) the plan tree that the workers
> would be executing.  I had an off-list discussion about this with
> Robert and he mentioned his concern that each parallel worker would
> have its own view of which child subplans of a parallel Append are
> "valid" that depends on the result of its own evaluation of initial
> pruning.   So, there may be race conditions whereby a worker may try
> to execute plan nodes that are no longer valid, for example, if the
> partition a worker considers valid is not viewed as such by the leader
> and thus not locked.  I shared my thoughts as to why that sounds
> unlikely at [1], though maybe I'm a bit too optimistic?
>
> * For multi-query portals, you can't now do ExecutorStart()
> immediately followed by ExecutorRun() for each query in the portal,
> because ExecutorStart() may now fail to start a plan if it gets
> invalidated.   So PortalStart() now does ExecutorStart()s for all
> queries and remembers the QueryDescs for PortalRun() then to do
> ExecutorRun()s using.  A consequence of this is that
> CommandCounterIncrement() now must be done between the
> ExecutorStart()s of the individual plans in PortalStart() and not
> between the ExecutorRun()s in PortalRunMulti().  make check-world
> passes with this new arrangement, though I'm not entirely confident
> that there are no problems lurking.

In an absolutely brown-paper-bag moment, I realized that I had not
updated src/backend/executor/README to reflect the changes to the
executor's control flow that this patch makes.   That is, after
scrapping the old design back in January whose details *were*
reflected in the patches before that redesign.

Anyway, the attached fixes that.

Tom, do you think you have bandwidth in the near future to give this
another look?  I think I've addressed the comments that you had given
back in April, though as mentioned in the previous message, there may
still be some funny-looking aspects still remaining.  In any case, I
have no intention of pressing ahead with the patch without another
committer having had a chance to sign off on it.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Thom Brown
Date:
On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> In an absolutely brown-paper-bag moment, I realized that I had not
> updated src/backend/executor/README to reflect the changes to the
> executor's control flow that this patch makes.   That is, after
> scrapping the old design back in January whose details *were*
> reflected in the patches before that redesign.
>
> Anyway, the attached fixes that.
>
> Tom, do you think you have bandwidth in the near future to give this
> another look?  I think I've addressed the comments that you had given
> back in April, though as mentioned in the previous message, there may
> still be some funny-looking aspects still remaining.  In any case, I
> have no intention of pressing ahead with the patch without another
> committer having had a chance to sign off on it.

I've only just started taking a look at this, and my first test drive
yields very impressive results:

8192 partitions (3 runs, 10000 rows)
Head 391.294989 382.622481 379.252236
Patched 13088.145995 13406.135531 13431.828051

Looking at your changes to README, I would like to suggest rewording
the following:

+table during planning.  This means that inheritance child tables, which are
+added to the query's range table during planning, if they are present in a
+cached plan tree would not have been locked.

To:

This means that inheritance child tables present in a cached plan
tree, which are added to the query's range table during planning,
would not have been locked.

Also, further down:

s/intiatialize/initialize/

I'll carry on taking a closer look and see if I can break it.

Thom



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Thom,

On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote:
> On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> > In an absolutely brown-paper-bag moment, I realized that I had not
> > updated src/backend/executor/README to reflect the changes to the
> > executor's control flow that this patch makes.   That is, after
> > scrapping the old design back in January whose details *were*
> > reflected in the patches before that redesign.
> >
> > Anyway, the attached fixes that.
> >
> > Tom, do you think you have bandwidth in the near future to give this
> > another look?  I think I've addressed the comments that you had given
> > back in April, though as mentioned in the previous message, there may
> > still be some funny-looking aspects still remaining.  In any case, I
> > have no intention of pressing ahead with the patch without another
> > committer having had a chance to sign off on it.
>
> I've only just started taking a look at this, and my first test drive
> yields very impressive results:
>
> 8192 partitions (3 runs, 10000 rows)
> Head 391.294989 382.622481 379.252236
> Patched 13088.145995 13406.135531 13431.828051

Just to be sure, did you use pgbench --Mprepared with plan_cache_mode
= force_generic_plan in postgresql.conf?

> Looking at your changes to README, I would like to suggest rewording
> the following:
>
> +table during planning.  This means that inheritance child tables, which are
> +added to the query's range table during planning, if they are present in a
> +cached plan tree would not have been locked.
>
> To:
>
> This means that inheritance child tables present in a cached plan
> tree, which are added to the query's range table during planning,
> would not have been locked.
>
> Also, further down:
>
> s/intiatialize/initialize/
>
> I'll carry on taking a closer look and see if I can break it.

Thanks for looking.  I've fixed these issues in the attached updated
patch.  I've also changed the position of a newly added paragraph in
src/backend/executor/README so that it doesn't break the flow of the
existing text.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Thom Brown
Date:
On Tue, 18 Jul 2023, 08:26 Amit Langote, <amitlangote09@gmail.com> wrote:
Hi Thom,

On Tue, Jul 18, 2023 at 1:33 AM Thom Brown <thom@linux.com> wrote:
> On Thu, 13 Jul 2023 at 13:59, Amit Langote <amitlangote09@gmail.com> wrote:
> > In an absolutely brown-paper-bag moment, I realized that I had not
> > updated src/backend/executor/README to reflect the changes to the
> > executor's control flow that this patch makes.   That is, after
> > scrapping the old design back in January whose details *were*
> > reflected in the patches before that redesign.
> >
> > Anyway, the attached fixes that.
> >
> > Tom, do you think you have bandwidth in the near future to give this
> > another look?  I think I've addressed the comments that you had given
> > back in April, though as mentioned in the previous message, there may
> > still be some funny-looking aspects still remaining.  In any case, I
> > have no intention of pressing ahead with the patch without another
> > committer having had a chance to sign off on it.
>
> I've only just started taking a look at this, and my first test drive
> yields very impressive results:
>
> 8192 partitions (3 runs, 10000 rows)
> Head 391.294989 382.622481 379.252236
> Patched 13088.145995 13406.135531 13431.828051

Just to be sure, did you use pgbench --Mprepared with plan_cache_mode
= force_generic_plan in postgresql.conf?

I did.

For full disclosure, I also had max_locks_per_transaction set to 10000.

> Looking at your changes to README, I would like to suggest rewording
> the following:
>
> +table during planning.  This means that inheritance child tables, which are
> +added to the query's range table during planning, if they are present in a
> +cached plan tree would not have been locked.
>
> To:
>
> This means that inheritance child tables present in a cached plan
> tree, which are added to the query's range table during planning,
> would not have been locked.
>
> Also, further down:
>
> s/intiatialize/initialize/
>
> I'll carry on taking a closer look and see if I can break it.

Thanks for looking.  I've fixed these issues in the attached updated
patch.  I've also changed the position of a newly added paragraph in
src/backend/executor/README so that it doesn't break the flow of the
existing text.

Thanks.

Thom

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
While chatting with Robert about this patch set, he suggested that it
would be better to break out some executor refactoring changes from
the main patch (0003) into a separate patch.  To wit, the changes to
make the PlanState tree cleanup in ExecEndPlan() non-recursive by
walking a flat list of PlanState nodes instead of the recursive tree
walk that ExecEndNode() currently does.  That allows us to cleanly
handle the cases where the PlanState tree is only partially
constructed when ExecInitNode() detects in the middle of its
construction that the plan tree is no longer valid after receiving and
processing an invalidation message on locking child tables.  Or at
least more cleanly than the previously proposed approach of adjusting
ExecEndNode() subroutines for the individual node types to gracefully
handle such partially initialized PlanState trees.

With the new approach, node type specific subroutines of ExecEndNode()
need not close its child nodes, because ExecEndPlan() would close each
node that would have been initialized directly.  I couldn't find any
instance of breakage by this decoupling of child node cleanup from
their parent node's cleanup.  Comments in ExecEndGather() and
ExecEndGatherMerge() appear to suggest that outerPlan must be closed
before the local cleanup:

 void
 ExecEndGather(GatherState *node)
 {
-   ExecEndNode(outerPlanState(node));  /* let children clean up first */
+   /* outerPlan is closed separately. */
    ExecShutdownGather(node);
    ExecFreeExprContext(&node->ps);

But I don't think there's a problem, because what ExecShutdownGather()
does seems entirely independent of cleanup of outerPlan.

As for the performance impact of initializing the list of initialized
nodes to use during the cleanup phase, I couldn't find a regression,
nor any improvement by replacing the tree walk by linear scan of a
list.  Actually, ExecEndNode() is pretty far down in the perf profile
anyway, so the performance difference caused by the patch hardly
matters.  See the following contrived example:

create table f();
analyze f;
explain (costs off) select count(*) from f f1, f f2, f f3, f f4, f f5,
f f6, f f7, f f8, f f9, f f10;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Aggregate
   ->  Nested Loop
         ->  Nested Loop
               ->  Nested Loop
                     ->  Nested Loop
                           ->  Nested Loop
                                 ->  Nested Loop
                                       ->  Nested Loop
                                             ->  Nested Loop
                                                   ->  Nested Loop
                                                         ->  Seq Scan on f f1
                                                         ->  Seq Scan on f f2
                                                   ->  Seq Scan on f f3
                                             ->  Seq Scan on f f4
                                       ->  Seq Scan on f f5
                                 ->  Seq Scan on f f6
                           ->  Seq Scan on f f7
                     ->  Seq Scan on f f8
               ->  Seq Scan on f f9
         ->  Seq Scan on f f10
(20 rows)

do $$
begin
for i in 1..100000 loop
perform count(*) from f f1, f f2, f f3, f f4, f f5, f f6, f f7, f f8,
f f9, f f10;
end loop;
end; $$;

Times for the DO:

Unpatched:
Time: 756.353 ms
Time: 745.752 ms
Time: 749.184 ms

Patched:
Time: 737.717 ms
Time: 747.815 ms
Time: 753.456 ms

I've attached the new refactoring patch as 0001.

Another change I've made in the main patch is to change the API of
ExecutorStart() (and ExecutorStart_hook) more explicitly to return a
boolean indicating whether or not the plan initialization was
successful.  That way seems better than making the callers figure that
out by seeing that QueryDesc.planstate is NULL and/or checking
QueryDesc.plan_valid.  Correspondingly, PortalStart() now also returns
true or false matching what ExecutorStart() returned.  I suppose this
better alerts any extensions that use the ExecutorStart_hook to fix
their code to do the right thing.

Having extracted the ExecEndNode() change, I'm also starting to feel
inclined to extract a couple of other bits from the main patch as
separate patches, such as moving the ExecutorStart() call from
PortalRun() to PortalStart() for the multi-query portals.  I'll do
that in the next version.

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Aug 2, 2023 at 10:39 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Having extracted the ExecEndNode() change, I'm also starting to feel
> inclined to extract a couple of other bits from the main patch as
> separate patches, such as moving the ExecutorStart() call from
> PortalRun() to PortalStart() for the multi-query portals.  I'll do
> that in the next version.

Here's a patch set where the refactoring to move the ExecutorStart()
calls to be closer to GetCachedPlan() (for the call sites that use a
CachedPlan) is extracted into a separate patch, 0002.  Its commit
message notes an aspect of this refactoring that I feel a bit nervous
about -- needing to also move the CommandCounterIncrement() call from
the loop in PortalRunMulti() to PortalStart() which now does
ExecutorStart() for the PORTAL_MULTI_QUERY case.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Thu, Aug 3, 2023 at 4:37 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Here's a patch set where the refactoring to move the ExecutorStart()
> calls to be closer to GetCachedPlan() (for the call sites that use a
> CachedPlan) is extracted into a separate patch, 0002.  Its commit
> message notes an aspect of this refactoring that I feel a bit nervous
> about -- needing to also move the CommandCounterIncrement() call from
> the loop in PortalRunMulti() to PortalStart() which now does
> ExecutorStart() for the PORTAL_MULTI_QUERY case.

I spent some time today reviewing 0001. Here are a few thoughts and
notes about things that I looked at.

First, I wondered whether it was really adequate for ExecEndPlan() to
just loop over estate->es_plan_nodes and call it good. Put
differently, is it possible that we could ever have more than one
relevant EState, say for a subplan or an EPQ execution or something,
so that this loop wouldn't cover everything? I found nothing to make
me think that this is a real danger.

Second, I wondered whether the ordering of cleanup operations could be
an issue. Right now, a node can position cleanup code before, after,
or both before and after recursing to child nodes, whereas with this
design change, the cleanup code will always be run before recursing to
child nodes. Here, I think we have problems. Both ExecGather and
ExecEndGatherMerge intentionally clean up the children before the
parent, so that the child shutdown happens before
ExecParallelCleanup(). Based on the comment and commit
acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
intentional, and you can sort of see why from looking at the stuff
that happens in ExecParallelCleanup(). If the instrumentation data
vanishes before the child nodes have a chance to clean things up,
maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If
the DSA vanishes, maybe we'll crash if we try to access it. If we
actually reach DestroyParallelContext(), we're just going to start
killing the workers. None of that sounds like what we want.

The good news, of a sort, is that I think this might be the only case
of this sort of problem. Most nodes recurse at the end, after doing
all the cleanup, so the behavior won't change. Moreover, even if it
did, most cleanup operations look pretty localized -- they affect only
the node itself, and not its children. A somewhat interesting case is
nodes associated with subplans. Right now, because of the coding of
ExecEndPlan, nodes associated with subplans are all cleaned up at the
very end, after everything that's not inside of a subplan. But with
this change, they'd get cleaned up in the order of initialization,
which actually seems more natural, as long as it doesn't break
anything, which I think it probably won't, since as I mention in most
cases node cleanup looks quite localized, i.e. it doesn't care whether
it happens before or after the cleanup of other nodes.

I think something will have to be done about the parallel query stuff,
though. I'm not sure exactly what. It is a little weird that Gather
and Gather Merge treat starting and killing workers as a purely
"private matter" that they can decide to handle without the executor
overall being very much aware of it. So maybe there's a way that some
of the cleanup logic here could be hoisted up into the general
executor machinery, that is, first end all the nodes, and then go
back, and end all the parallelism using, maybe, another list inside of
the estate. However, I think that the existence of ExecShutdownNode()
is a complication here -- we need to make sure that we don't break
either the case where that happen before overall plan shutdown, or the
case where it doesn't.

Third, a couple of minor comments on details of how you actually made
these changes in the patch set. Personally, I would remove all of the
"is closed separately" comments that you added. I think it's a
violation of the general coding principle that you should make the
code look like it's always been that way. Sure, in the immediate
future, people might wonder why you don't need to recurse, but 5 or 10
years from now that's just going to be clutter. Second, in the cases
where the ExecEndNode functions end up completely empty, I would
suggest just removing the functions entirely and making the switch
that dispatches on the node type have a switch case that lists all the
nodes that don't need a callback here and say /* Nothing do for these
node types */ break;. This will save a few CPU cycles and I think it
will be easier to read as well.

Fourth, I wonder whether we really need this patch at all. I initially
thought we did, because if we abandon the initialization of a plan
partway through, then we end up with a plan that is in a state that
previously would never have occurred, and we still have to be able to
clean it up. However, perhaps it's a difference without a distinction.
Say we have a partial plan tree, where not all of the PlanState nodes
ever got created. We then just call the existing version of
ExecEndPlan() on it, with no changes. What goes wrong? Sure, we might
call ExecEndNode() on some null pointers where in the current world
there would always be valid pointers, but ExecEndNode() will handle
that just fine, by doing nothing for those nodes, because it starts
with a NULL-check.

Another alternative design might be to switch ExecEndNode to use
planstate_tree_walker to walk the node tree, removing the walk from
the node-type-specific functions as in this patch, and deleting the
end-node functions that are no longer required altogether, as proposed
above. I somehow feel that this would be cleaner than the status quo,
but here again, I'm not sure we really need it. planstate_tree_walker
would just pass over any NULL pointers that it found without doing
anything, but the current code does that too, so while this might be
more beautiful than what we have now, I'm not sure that there's any
real reason to do it. The fact that, like the current patch, it would
change the order in which nodes are cleaned up is also an issue -- the
Gather/Gather Merge ordering issues might be easier to handle this way
with some hack in ExecEndNode() than they are with the design you have
now, but we'd still have to do something about them, I believe.

Sorry if this is a bit of a meandering review, but those are my thoughts.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Second, I wondered whether the ordering of cleanup operations could be
> an issue. Right now, a node can position cleanup code before, after,
> or both before and after recursing to child nodes, whereas with this
> design change, the cleanup code will always be run before recursing to
> child nodes. Here, I think we have problems. Both ExecGather and
> ExecEndGatherMerge intentionally clean up the children before the
> parent, so that the child shutdown happens before
> ExecParallelCleanup(). Based on the comment and commit
> acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
> intentional, and you can sort of see why from looking at the stuff
> that happens in ExecParallelCleanup().

Right, I doubt that changing that is going to work out well.
Hash joins might have issues with it too.

Could it work to make the patch force child cleanup before parent,
instead of after?  Or would that break other places?

On the whole though I think it's probably a good idea to leave
parent nodes in control of the timing, so I kind of side with
your later comment about whether we want to change this at all.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Mon, Aug 7, 2023 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, I doubt that changing that is going to work out well.
> Hash joins might have issues with it too.

I thought about the case, because Hash and Hash Join are such closely
intertwined nodes, but I don't see any problem there. It doesn't
really look like it would matter in what order things got cleaned up.
Unless I'm missing something, all of the data structures are just
independent things that we have to get rid of sometime.

> Could it work to make the patch force child cleanup before parent,
> instead of after?  Or would that break other places?

To me, it seems like the overwhelming majority of the code simply
doesn't care. You could pick an order out of a hat and it would be
100% OK. But I haven't gone and looked through it with this specific
idea in mind.

> On the whole though I think it's probably a good idea to leave
> parent nodes in control of the timing, so I kind of side with
> your later comment about whether we want to change this at all.

My overall feeling here is that what Gather and Gather Merge is doing
is pretty weird. I think I kind of knew that at the time this was all
getting implemented and reviewed, but I wasn't keen to introduce more
infrastructure changes than necessary given that parallel query, as a
project, was still pretty new and I didn't want to give other hackers
more reasons to be unhappy with what was already a lot of very
wide-ranging change to the system. A good number of years having gone
by now, and other people having worked on that code some more, I'm not
too worried about someone calling for a wholesale revert of parallel
query. However, there's a second problem here as well, which is that
I'm still not sure what the right thing to do is. We've fiddled around
with the shutdown sequence for parallel query a number of times now,
and I think there's still stuff that doesn't work quite right,
especially around getting all of the instrumentation data back to the
leader. I haven't spent enough time on this recently enough to be sure
what if any problems remain, though.

So on the one hand, I don't really like the fact that we have an
ad-hoc recursion arrangement here, instead of using
planstate_tree_walker or, as Amit proposes, a List. Giving subordinate
nodes control over the ordering when they don't really need it just
means we have more code with more possibility for bugs and less
certainty about whether the theoretical flexibility is doing anything
in practice. But on the other hand, because we know that at least for
the Gather/GatherMerge case it seems like it probably matters
somewhat, it definitely seems appealing not to change anything as part
of this patch set that we don't really have to.

I've had it firmly in my mind here that we were going to need to
change something somehow -- I mean, the possibility of returning in
the middle of node initialization seems like a pretty major change to
the way this stuff works, and it seems hard for me to believe that we
can just do that and not have to adjust any code anywhere else. Can it
really be true that we can do that and yet not end up creating any
states anywhere with which the current cleanup code is unprepared to
cope? Maybe, but it would seem like rather good luck if that's how it
shakes out. Still, at the moment, I'm having a hard time understanding
what this particular change buys us.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Aug 8, 2023 at 12:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 3, 2023 at 4:37 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Here's a patch set where the refactoring to move the ExecutorStart()
> > calls to be closer to GetCachedPlan() (for the call sites that use a
> > CachedPlan) is extracted into a separate patch, 0002.  Its commit
> > message notes an aspect of this refactoring that I feel a bit nervous
> > about -- needing to also move the CommandCounterIncrement() call from
> > the loop in PortalRunMulti() to PortalStart() which now does
> > ExecutorStart() for the PORTAL_MULTI_QUERY case.
>
> I spent some time today reviewing 0001. Here are a few thoughts and
> notes about things that I looked at.

Thanks for taking a look at this.

> First, I wondered whether it was really adequate for ExecEndPlan() to
> just loop over estate->es_plan_nodes and call it good. Put
> differently, is it possible that we could ever have more than one
> relevant EState, say for a subplan or an EPQ execution or something,
> so that this loop wouldn't cover everything? I found nothing to make
> me think that this is a real danger.

Check.

> Second, I wondered whether the ordering of cleanup operations could be
> an issue. Right now, a node can position cleanup code before, after,
> or both before and after recursing to child nodes, whereas with this
> design change, the cleanup code will always be run before recursing to
> child nodes.

Because a node is appended to es_planstate_nodes at the end of
ExecInitNode(), child nodes get added before their parent nodes.  So
the children are cleaned up first.

> Here, I think we have problems. Both ExecGather and
> ExecEndGatherMerge intentionally clean up the children before the
> parent, so that the child shutdown happens before
> ExecParallelCleanup(). Based on the comment and commit
> acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
> intentional, and you can sort of see why from looking at the stuff
> that happens in ExecParallelCleanup(). If the instrumentation data
> vanishes before the child nodes have a chance to clean things up,
> maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If
> the DSA vanishes, maybe we'll crash if we try to access it. If we
> actually reach DestroyParallelContext(), we're just going to start
> killing the workers. None of that sounds like what we want.
>
> The good news, of a sort, is that I think this might be the only case
> of this sort of problem. Most nodes recurse at the end, after doing
> all the cleanup, so the behavior won't change. Moreover, even if it
> did, most cleanup operations look pretty localized -- they affect only
> the node itself, and not its children. A somewhat interesting case is
> nodes associated with subplans. Right now, because of the coding of
> ExecEndPlan, nodes associated with subplans are all cleaned up at the
> very end, after everything that's not inside of a subplan. But with
> this change, they'd get cleaned up in the order of initialization,
> which actually seems more natural, as long as it doesn't break
> anything, which I think it probably won't, since as I mention in most
> cases node cleanup looks quite localized, i.e. it doesn't care whether
> it happens before or after the cleanup of other nodes.
>
> I think something will have to be done about the parallel query stuff,
> though. I'm not sure exactly what. It is a little weird that Gather
> and Gather Merge treat starting and killing workers as a purely
> "private matter" that they can decide to handle without the executor
> overall being very much aware of it. So maybe there's a way that some
> of the cleanup logic here could be hoisted up into the general
> executor machinery, that is, first end all the nodes, and then go
> back, and end all the parallelism using, maybe, another list inside of
> the estate. However, I think that the existence of ExecShutdownNode()
> is a complication here -- we need to make sure that we don't break
> either the case where that happen before overall plan shutdown, or the
> case where it doesn't.

Given that children are closed before parent, the order of operations
in ExecEndGather[Merge] is unchanged.

> Third, a couple of minor comments on details of how you actually made
> these changes in the patch set. Personally, I would remove all of the
> "is closed separately" comments that you added. I think it's a
> violation of the general coding principle that you should make the
> code look like it's always been that way. Sure, in the immediate
> future, people might wonder why you don't need to recurse, but 5 or 10
> years from now that's just going to be clutter. Second, in the cases
> where the ExecEndNode functions end up completely empty, I would
> suggest just removing the functions entirely and making the switch
> that dispatches on the node type have a switch case that lists all the
> nodes that don't need a callback here and say /* Nothing do for these
> node types */ break;. This will save a few CPU cycles and I think it
> will be easier to read as well.

I agree with both suggestions.

> Fourth, I wonder whether we really need this patch at all. I initially
> thought we did, because if we abandon the initialization of a plan
> partway through, then we end up with a plan that is in a state that
> previously would never have occurred, and we still have to be able to
> clean it up. However, perhaps it's a difference without a distinction.
> Say we have a partial plan tree, where not all of the PlanState nodes
> ever got created. We then just call the existing version of
> ExecEndPlan() on it, with no changes. What goes wrong? Sure, we might
> call ExecEndNode() on some null pointers where in the current world
> there would always be valid pointers, but ExecEndNode() will handle
> that just fine, by doing nothing for those nodes, because it starts
> with a NULL-check.

Well, not all cleanup actions for a given node type are a recursive
call to ExecEndNode(), some are also things like this:

    /*
     * clean out the tuple table
     */
    ExecClearTuple(node->ps.ps_ResultTupleSlot);

But should ExecInitNode() subroutines return the partially initialized
PlanState node or NULL on detecting invalidation?  If I'm
understanding how you think this should be working correctly, I think
you mean the former, because if it were the latter, ExecInitNode()
would end up returning NULL at the top for the root and then there's
nothing to pass to ExecEndNode(), so no way to clean up to begin with.
In that case, I think we will need to adjust ExecEndNode() subroutines
to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
example.  That's something Tom had said he doesn't like very much [1].

Some node types such as Append, BitmapAnd, etc. that contain a list of
subplans would need some adjustment, such as using palloc0 for
as_appendplans[], etc. so that uninitialized subplans have NULL in the
array.

There are also issues around ForeignScan, CustomScan
ExecEndNode()-time callbacks when they are partially initialized -- is
it OK to call the *EndScan callback if the *BeginScan one may not have
been called to begin with?  Though, perhaps we can adjust the
ExecInitNode() subroutines for those to return NULL by opening the
relation and checking for invalidation at the beginning instead of in
the middle.  That should be done for all Scan or leaf-level node
types.

Anyway, I guess, for the patch's purpose, maybe we should bite the
bullet and make those adjustments rather than change ExecEndNode() as
proposed.  I can give that another try.

> Another alternative design might be to switch ExecEndNode to use
> planstate_tree_walker to walk the node tree, removing the walk from
> the node-type-specific functions as in this patch, and deleting the
> end-node functions that are no longer required altogether, as proposed
> above. I somehow feel that this would be cleaner than the status quo,
> but here again, I'm not sure we really need it. planstate_tree_walker
> would just pass over any NULL pointers that it found without doing
> anything, but the current code does that too, so while this might be
> more beautiful than what we have now, I'm not sure that there's any
> real reason to do it. The fact that, like the current patch, it would
> change the order in which nodes are cleaned up is also an issue -- the
> Gather/Gather Merge ordering issues might be easier to handle this way
> with some hack in ExecEndNode() than they are with the design you have
> now, but we'd still have to do something about them, I believe.

It might be interesting to see if introducing planstate_tree_walker()
in ExecEndNode() makes it easier to reason about ExecEndNode()
generally speaking, but I think you may be that doing so may not
really make matters easier for the partially initialized planstate
tree case.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> But should ExecInitNode() subroutines return the partially initialized
> PlanState node or NULL on detecting invalidation?  If I'm
> understanding how you think this should be working correctly, I think
> you mean the former, because if it were the latter, ExecInitNode()
> would end up returning NULL at the top for the root and then there's
> nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> In that case, I think we will need to adjust ExecEndNode() subroutines
> to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> example.  That's something Tom had said he doesn't like very much [1].

Yeah, I understood Tom's goal as being "don't return partially
initialized nodes."

Personally, I'm not sure that's an important goal. In fact, I don't
even think it's a desirable one. It doesn't look difficult to audit
the end-node functions for cases where they'd fail if a particular
pointer were NULL instead of pointing to some real data, and just
fixing all such cases to have NULL-tests looks like purely mechanical
work that we are unlikely to get wrong. And at least some cases
wouldn't require any changes at all.

If we don't do that, the complexity doesn't go away. It just moves
someplace else. Presumably what we do in that case is have
ExecInitNode functions undo any initialization that they've already
done before returning NULL. There are basically two ways to do that.
Option one is to add code at the point where they return early to
clean up anything they've already initialized, but that code is likely
to substantially duplicate whatever the ExecEndNode function already
knows how to do, and it's very easy for logic like this to get broken
if somebody rearranges an ExecInitNode function down the road. Option
two is to rearrange the ExecInitNode functions now, to open relations
or recurse at the beginning, so that we discover the need to fail
before we initialize anything. That restricts our ability to further
rearrange the functions in future somewhat, but more importantly,
IMHO, it introduces more risk right now. Checking that the ExecEndNode
function will not fail if some pointers are randomly null is a lot
easier than checking that changing the order of operations in an
ExecInitNode function breaks nothing.

I'm not here to say that we can't do one of those things. But I think
adding null-tests to ExecEndNode functions looks like *far* less work
and *way* less risk.

There's a second issue here, too, which is when we abort ExecInitNode
partway through, how do we signal that? You're rightly pointing out
here that if we do that by returning NULL, then we don't do it by
returning a pointer to the partially initialized node that we just
created, which means that we either need to store those partially
initialized nodes in a separate data structure as you propose to do in
0001, or else we need to pick a different signalling convention. We
could change (a) ExecInitNode to have an additional argument, bool
*kaboom, or (b) we could make it return bool and return the node
pointer via a new additional argument, or (c) we could put a Boolean
flag into the estate and let the function signal failure by flipping
the value of the flag. If we do any of those things, then as far as I
can see 0001 is unnecessary. If we do none of them but also avoid
creating partially initialized nodes by one of the two techniques
mentioned two paragraphs prior, then 0001 is also unnecessary. If we
do none of them but do create partially initialized nodes, then we
need 0001.

So if this were a restaurant menu, then it might look like this:

Prix Fixe Menu (choose one from each)

First Course - How do we clean up after partial initialization?
(1) ExecInitNode functions produce partially initialized nodes
(2) ExecInitNode functions get refactored so that the stuff that can
cause early exit always happens first, so that no cleanup is ever
needed
(3) ExecInitNode functions do any required cleanup in situ

Second Course - How do we signal that initialization stopped early?
(A) Return NULL.
(B) Add a bool * out-parmeter to ExecInitNode.
(C) Add a Node * out-parameter to ExecInitNode and change the return
value to bool.
(D) Add a bool to the EState.
(E) Something else, maybe.

I think that we need 0001 if we choose specifically (1) and (A). My
gut feeling is that the least-invasive way to do this project is to
choose (1) and (D). My second choice would be (1) and (C), and my
third choice would be (1) and (A). If I can't have (1), I think I
prefer (2) over (3), but I also believe I prefer hiding in a deep hole
to either of them. Maybe I'm not seeing the whole picture correctly
here, but both (2) and (3) look awfully painful to me.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > But should ExecInitNode() subroutines return the partially initialized
> > PlanState node or NULL on detecting invalidation?  If I'm
> > understanding how you think this should be working correctly, I think
> > you mean the former, because if it were the latter, ExecInitNode()
> > would end up returning NULL at the top for the root and then there's
> > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > In that case, I think we will need to adjust ExecEndNode() subroutines
> > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > example.  That's something Tom had said he doesn't like very much [1].
>
> Yeah, I understood Tom's goal as being "don't return partially
> initialized nodes."
>
> Personally, I'm not sure that's an important goal. In fact, I don't
> even think it's a desirable one. It doesn't look difficult to audit
> the end-node functions for cases where they'd fail if a particular
> pointer were NULL instead of pointing to some real data, and just
> fixing all such cases to have NULL-tests looks like purely mechanical
> work that we are unlikely to get wrong. And at least some cases
> wouldn't require any changes at all.
>
> If we don't do that, the complexity doesn't go away. It just moves
> someplace else. Presumably what we do in that case is have
> ExecInitNode functions undo any initialization that they've already
> done before returning NULL. There are basically two ways to do that.
> Option one is to add code at the point where they return early to
> clean up anything they've already initialized, but that code is likely
> to substantially duplicate whatever the ExecEndNode function already
> knows how to do, and it's very easy for logic like this to get broken
> if somebody rearranges an ExecInitNode function down the road.

Yeah, I too am not a fan of making ExecInitNode() clean up partially
initialized nodes.

> Option
> two is to rearrange the ExecInitNode functions now, to open relations
> or recurse at the beginning, so that we discover the need to fail
> before we initialize anything. That restricts our ability to further
> rearrange the functions in future somewhat, but more importantly,
> IMHO, it introduces more risk right now. Checking that the ExecEndNode
> function will not fail if some pointers are randomly null is a lot
> easier than checking that changing the order of operations in an
> ExecInitNode function breaks nothing.
>
> I'm not here to say that we can't do one of those things. But I think
> adding null-tests to ExecEndNode functions looks like *far* less work
> and *way* less risk.

+1

> There's a second issue here, too, which is when we abort ExecInitNode
> partway through, how do we signal that? You're rightly pointing out
> here that if we do that by returning NULL, then we don't do it by
> returning a pointer to the partially initialized node that we just
> created, which means that we either need to store those partially
> initialized nodes in a separate data structure as you propose to do in
> 0001,
>
> or else we need to pick a different signalling convention. We
> could change (a) ExecInitNode to have an additional argument, bool
> *kaboom, or (b) we could make it return bool and return the node
> pointer via a new additional argument, or (c) we could put a Boolean
> flag into the estate and let the function signal failure by flipping
> the value of the flag.

The failure can already be detected by seeing that
ExecPlanIsValid(estate) is false.  The question is what ExecInitNode()
or any of its subroutines should return once it is.  I think the
following convention works:

Return partially initialized state from ExecInit* function where we
detect the invalidation after calling ExecInitNode() on a child plan,
so that ExecEndNode() can recurse to clean it up.

Return NULL from ExecInit* functions where we detect the invalidation
after opening and locking a relation but before calling ExecInitNode()
to initialize a child plan if there's one at all.  Even if we may set
things like ExprContext, TupleTableSlot fields, they are cleaned up
independently of the plan tree anyway via the cleanup called with
es_exprcontexts, es_tupleTable, respectively.  I even noticed bits
like this in ExecEnd* functions:

-   /*
-    * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
-    */
-#ifdef NOT_USED
-   ExecFreeExprContext(&node->ss.ps);
-   if (node->ioss_RuntimeContext)
-       FreeExprContext(node->ioss_RuntimeContext, true);
-#endif

So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
is unnecessary but remain around because nobody cared about and got
around to getting rid of it.

> If we do any of those things, then as far as I
> can see 0001 is unnecessary. If we do none of them but also avoid
> creating partially initialized nodes by one of the two techniques
> mentioned two paragraphs prior, then 0001 is also unnecessary. If we
> do none of them but do create partially initialized nodes, then we
> need 0001.
>
> So if this were a restaurant menu, then it might look like this:
>
> Prix Fixe Menu (choose one from each)
>
> First Course - How do we clean up after partial initialization?
> (1) ExecInitNode functions produce partially initialized nodes
> (2) ExecInitNode functions get refactored so that the stuff that can
> cause early exit always happens first, so that no cleanup is ever
> needed
> (3) ExecInitNode functions do any required cleanup in situ
>
> Second Course - How do we signal that initialization stopped early?
> (A) Return NULL.
> (B) Add a bool * out-parmeter to ExecInitNode.
> (C) Add a Node * out-parameter to ExecInitNode and change the return
> value to bool.
> (D) Add a bool to the EState.
> (E) Something else, maybe.
>
> I think that we need 0001 if we choose specifically (1) and (A). My
> gut feeling is that the least-invasive way to do this project is to
> choose (1) and (D). My second choice would be (1) and (C), and my
> third choice would be (1) and (A). If I can't have (1), I think I
> prefer (2) over (3), but I also believe I prefer hiding in a deep hole
> to either of them. Maybe I'm not seeing the whole picture correctly
> here, but both (2) and (3) look awfully painful to me.

I think what I've ended up with in the attached 0001 (WIP) is both
(1), (2), and (D).  As mentioned above, (D) is implemented with the
ExecPlanStillValid() function.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Aug 11, 2023 at 14:31 Amit Langote <amitlangote09@gmail.com> wrote:
On Wed, Aug 9, 2023 at 1:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 8, 2023 at 10:32 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > But should ExecInitNode() subroutines return the partially initialized
> > PlanState node or NULL on detecting invalidation?  If I'm
> > understanding how you think this should be working correctly, I think
> > you mean the former, because if it were the latter, ExecInitNode()
> > would end up returning NULL at the top for the root and then there's
> > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > In that case, I think we will need to adjust ExecEndNode() subroutines
> > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > example.  That's something Tom had said he doesn't like very much [1].
>
> Yeah, I understood Tom's goal as being "don't return partially
> initialized nodes."
>
> Personally, I'm not sure that's an important goal. In fact, I don't
> even think it's a desirable one. It doesn't look difficult to audit
> the end-node functions for cases where they'd fail if a particular
> pointer were NULL instead of pointing to some real data, and just
> fixing all such cases to have NULL-tests looks like purely mechanical
> work that we are unlikely to get wrong. And at least some cases
> wouldn't require any changes at all.
>
> If we don't do that, the complexity doesn't go away. It just moves
> someplace else. Presumably what we do in that case is have
> ExecInitNode functions undo any initialization that they've already
> done before returning NULL. There are basically two ways to do that.
> Option one is to add code at the point where they return early to
> clean up anything they've already initialized, but that code is likely
> to substantially duplicate whatever the ExecEndNode function already
> knows how to do, and it's very easy for logic like this to get broken
> if somebody rearranges an ExecInitNode function down the road.

Yeah, I too am not a fan of making ExecInitNode() clean up partially
initialized nodes.

> Option
> two is to rearrange the ExecInitNode functions now, to open relations
> or recurse at the beginning, so that we discover the need to fail
> before we initialize anything. That restricts our ability to further
> rearrange the functions in future somewhat, but more importantly,
> IMHO, it introduces more risk right now. Checking that the ExecEndNode
> function will not fail if some pointers are randomly null is a lot
> easier than checking that changing the order of operations in an
> ExecInitNode function breaks nothing.
>
> I'm not here to say that we can't do one of those things. But I think
> adding null-tests to ExecEndNode functions looks like *far* less work
> and *way* less risk.

+1

> There's a second issue here, too, which is when we abort ExecInitNode
> partway through, how do we signal that? You're rightly pointing out
> here that if we do that by returning NULL, then we don't do it by
> returning a pointer to the partially initialized node that we just
> created, which means that we either need to store those partially
> initialized nodes in a separate data structure as you propose to do in
> 0001,
>
> or else we need to pick a different signalling convention. We
> could change (a) ExecInitNode to have an additional argument, bool
> *kaboom, or (b) we could make it return bool and return the node
> pointer via a new additional argument, or (c) we could put a Boolean
> flag into the estate and let the function signal failure by flipping
> the value of the flag.

The failure can already be detected by seeing that
ExecPlanIsValid(estate) is false.  The question is what ExecInitNode()
or any of its subroutines should return once it is.  I think the
following convention works:

Return partially initialized state from ExecInit* function where we
detect the invalidation after calling ExecInitNode() on a child plan,
so that ExecEndNode() can recurse to clean it up.

Return NULL from ExecInit* functions where we detect the invalidation
after opening and locking a relation but before calling ExecInitNode()
to initialize a child plan if there's one at all.  Even if we may set
things like ExprContext, TupleTableSlot fields, they are cleaned up
independently of the plan tree anyway via the cleanup called with
es_exprcontexts, es_tupleTable, respectively.  I even noticed bits
like this in ExecEnd* functions:

-   /*
-    * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
-    */
-#ifdef NOT_USED
-   ExecFreeExprContext(&node->ss.ps);
-   if (node->ioss_RuntimeContext)
-       FreeExprContext(node->ioss_RuntimeContext, true);
-#endif

So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
is unnecessary but remain around because nobody cared about and got
around to getting rid of it.

> If we do any of those things, then as far as I
> can see 0001 is unnecessary. If we do none of them but also avoid
> creating partially initialized nodes by one of the two techniques
> mentioned two paragraphs prior, then 0001 is also unnecessary. If we
> do none of them but do create partially initialized nodes, then we
> need 0001.
>
> So if this were a restaurant menu, then it might look like this:
>
> Prix Fixe Menu (choose one from each)
>
> First Course - How do we clean up after partial initialization?
> (1) ExecInitNode functions produce partially initialized nodes
> (2) ExecInitNode functions get refactored so that the stuff that can
> cause early exit always happens first, so that no cleanup is ever
> needed
> (3) ExecInitNode functions do any required cleanup in situ
>
> Second Course - How do we signal that initialization stopped early?
> (A) Return NULL.
> (B) Add a bool * out-parmeter to ExecInitNode.
> (C) Add a Node * out-parameter to ExecInitNode and change the return
> value to bool.
> (D) Add a bool to the EState.
> (E) Something else, maybe.
>
> I think that we need 0001 if we choose specifically (1) and (A). My
> gut feeling is that the least-invasive way to do this project is to
> choose (1) and (D). My second choice would be (1) and (C), and my
> third choice would be (1) and (A). If I can't have (1), I think I
> prefer (2) over (3), but I also believe I prefer hiding in a deep hole
> to either of them. Maybe I'm not seeing the whole picture correctly
> here, but both (2) and (3) look awfully painful to me.

I think what I've ended up with in the attached 0001 (WIP) is both
(1), (2), and (D).  As mentioned above, (D) is implemented with the
ExecPlanStillValid() function.

After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is remove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes.  We could instead have ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the close-child-before-self behavior for Gather* nodes, and close node type specific cleanup functions for nodes that do have any local cleanup to do.  Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual bottom so that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions.
--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote:
> After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do is
removethe functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes.  We could instead
haveExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the
close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do
haveany local cleanup to do.  Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual
bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. 

I think 0001 needs to be split up. Like, this is code cleanup:

-       /*
-        * Free the exprcontext
-        */
-       ExecFreeExprContext(&node->ss.ps);

This is providing for NULL pointers where we don't currently:

-       list_free_deep(aggstate->hash_batches);
+       if (aggstate->hash_batches)
+               list_free_deep(aggstate->hash_batches);

And this is the early return mechanism per se:

+       if (!ExecPlanStillValid(estate))
+               return aggstate;

I think at least those 3 kinds of changes deserve to be in separate
patches with separate commit messages explaining the rationale behind
each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions.
These calls are no longer required, because <reasons>. Removing them
saves a few CPU cycles and simplifies planned refactoring, so do
that."

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Thanks for taking a look.

On Mon, Aug 28, 2023 at 10:43 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 11, 2023 at 9:50 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > After removing the unnecessary cleanup code from most node types’ ExecEnd* functions, one thing I’m tempted to do
isremove the functions that do nothing else but recurse to close the outerPlan, innerPlan child nodes.  We could
insteadhave ExecEndNode() itself recurse to close outerPlan, innerPlan child nodes at the top, which preserves the
close-child-before-selfbehavior for Gather* nodes, and close node type specific cleanup functions for nodes that do
haveany local cleanup to do.  Perhaps, we could even use planstate_tree_walker() called at the top instead of the usual
bottomso that nodes with a list of child subplans like Append also don’t need to have their own ExecEnd* functions. 
>
> I think 0001 needs to be split up. Like, this is code cleanup:
>
> -       /*
> -        * Free the exprcontext
> -        */
> -       ExecFreeExprContext(&node->ss.ps);
>
> This is providing for NULL pointers where we don't currently:
>
> -       list_free_deep(aggstate->hash_batches);
> +       if (aggstate->hash_batches)
> +               list_free_deep(aggstate->hash_batches);
>
> And this is the early return mechanism per se:
>
> +       if (!ExecPlanStillValid(estate))
> +               return aggstate;
>
> I think at least those 3 kinds of changes deserve to be in separate
> patches with separate commit messages explaining the rationale behind
> each e.g. "Remove unnecessary cleanup calls in ExecEnd* functions.
> These calls are no longer required, because <reasons>. Removing them
> saves a few CPU cycles and simplifies planned refactoring, so do
> that."

Breaking up the patch as you describe makes sense, so I've done that:

Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.

0002 adds NULLness checks in ExecEnd*() routines on some pointers that
may not be initialized by the corresponding ExecInit*() routines in
the case where it returns early.

0003 adds the early return mechanism based on checking CachedPlan
invalidation, though no CachedPlan is actually passed to the executor
yet, so no functional changes here yet.

Other patches are rebased over these.  One significant change is in
0004 which does the refactoring to make the callers of ExecutorStart()
aware that it may now return with a partially initialized planstate
tree that should not be executed.  I added a new flag
EState.es_canceled to denote that state of the execution to complement
the existing es_finished.  I also needed to add
AfterTriggerCancelQuery() to ensure that we don't attempt to fire a
canceled query's triggers.  Most of these changes are needed only to
appease the various Asserts in these parts of the code and I thought
they are warranted given the introduction of a new state of query
execution.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.

It also adds a few random Assert()s to verify that unrelated pointers
are not NULL. I suggest that it shouldn't do that.

The commit message doesn't mention the removal of the calls to
ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I
think it would be nice to mention it in the commit message, assuming
that it is in fact OK.

I suggest changing the subject line of the commit to something like
"Remove obsolete executor cleanup code."

> 0002 adds NULLness checks in ExecEnd*() routines on some pointers that
> may not be initialized by the corresponding ExecInit*() routines in
> the case where it returns early.

I think you should only add these where it's needed. For example, I
think list_free_deep(NIL) is fine.

The changes to ExecEndForeignScan look like they include stuff that
belongs in 0001.

Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to
implicit ones like if (x), but opinions vary.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Sep 5, 2023 at 11:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 5, 2023 at 3:13 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached 0001 removes unnecessary cleanup calls from ExecEnd*() routines.
>
> It also adds a few random Assert()s to verify that unrelated pointers
> are not NULL. I suggest that it shouldn't do that.

OK, removed.

> The commit message doesn't mention the removal of the calls to
> ExecDropSingleTupleTableSlot. It's not clear to me why that's OK and I
> think it would be nice to mention it in the commit message, assuming
> that it is in fact OK.

That is not OK, so I dropped their removal. I think I confused them
with slots in other functions initialized with
ExecInitExtraTupleSlot() that *are* put into the estate.

> I suggest changing the subject line of the commit to something like
> "Remove obsolete executor cleanup code."

Sure.

> > 0002 adds NULLness checks in ExecEnd*() routines on some pointers that
> > may not be initialized by the corresponding ExecInit*() routines in
> > the case where it returns early.
>
> I think you should only add these where it's needed. For example, I
> think list_free_deep(NIL) is fine.

OK, done.

> The changes to ExecEndForeignScan look like they include stuff that
> belongs in 0001.

Oops, yes.  Moved to 0001.

> Personally, I prefer explicit NULL-tests i.e. if (x != NULL) to
> implicit ones like if (x), but opinions vary.

I agree, so changed all the new tests to use (x != NULL) form.
Typically, I try to stick with whatever style is used in the nearby
code, though I can see both styles being used in the ExecEnd*()
routines.  I opted to use the style that we both happen to prefer.

Attached updated patches.  Thanks for the review.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Wed, Sep 6, 2023 at 5:12 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Attached updated patches.  Thanks for the review.

I think 0001 looks ready to commit. I'm not sure that the commit
message needs to mention future patches here, since this code cleanup
seems like a good idea regardless, but if you feel otherwise, fair
enough.

On 0002, some questions:

- In ExecEndLockRows, is the call to EvalPlanQualEnd a concern? i.e.
Does that function need any adjustment?
- In ExecEndMemoize, should there be a null-test around
MemoryContextDelete(node->tableContext) as we have in
ExecEndRecursiveUnion, ExecEndSetOp, etc.?

I wonder how we feel about setting pointers to NULL after freeing the
associated data structures. The existing code isn't consistent about
doing that, and making it do so would be a fairly large change that
would bloat this patch quite a bit. On the other hand, I think it's a
good practice as a general matter, and we do do it in some ExecEnd
functions.

On 0003, I have some doubt about whether we really have all the right
design decisions in detail here:

- Why have this weird rule where sometimes we return NULL and other
times the planstate? Is there any point to such a coding rule? Why not
just always return the planstate?

- Is there any point to all of these early exit cases? For example, in
ExecInitBitmapAnd, why exit early if initialization fails? Why not
just plunge ahead and if initialization failed the caller will notice
that and when we ExecEndNode some of the child node pointers will be
NULL but who cares? The obvious disadvantage of this approach is that
we're doing a bunch of unnecessary initialization, but we're also
speeding up the common case where we don't need to abort by avoiding a
branch that will rarely be taken. I'm not quite sure what the right
thing to do is here.

- The cases where we call ExecGetRangeTableRelation or
ExecOpenScanRelation are a bit subtler ... maybe initialization that
we're going to do later is going to barf if the tuple descriptor of
the relation isn't what we thought it was going to be. In that case it
becomes important to exit early. But if that's not actually a problem,
then we could apply the same principle here also -- don't pollute the
code with early-exit cases, just let it do its thing and sort it out
later. Do you know what the actual problems would be here if we didn't
exit early in these cases?

- Depending on the answers to the above points, one thing we could
think of doing is put an early exit case into ExecInitNode itself: if
(unlikely(!ExecPlanStillValid(whatever)) return NULL. Maybe Andres or
someone is going to argue that that checks too often and is thus too
expensive, but it would be a lot more maintainable than having similar
checks strewn throughout the ExecInit* functions. Perhaps it deserves
some thought/benchmarking. More generally, if there's anything we can
do to centralize these checks in fewer places, I think that would be
worth considering. The patch isn't terribly large as it stands, so I
don't necessarily think that this is a critical issue, but I'm just
wondering if we can do better. I'm not even sure that it would be too
expensive to just initialize the whole plan always, and then just do
one test at the end. That's not OK if the changed tuple descriptor (or
something else) is going to crash or error out in a funny way or
something before initialization is completed, but if it's just going
to result in burning a few CPU cycles in a corner case, I don't know
if we should really care.

- The "At this point" comments don't give any rationale for why we
shouldn't have received any such invalidation messages. That  makes
them fairly useless; the Assert by itself clarifies that you think
that case shouldn't happen. The comment's job is to justify that
claim.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 6, 2023 at 5:12 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > Attached updated patches.  Thanks for the review.
>
> I think 0001 looks ready to commit. I'm not sure that the commit
> message needs to mention future patches here, since this code cleanup
> seems like a good idea regardless, but if you feel otherwise, fair
> enough.

OK, I will remove the mention of future patches.

> On 0002, some questions:
>
> - In ExecEndLockRows, is the call to EvalPlanQualEnd a concern? i.e.
> Does that function need any adjustment?

I think it does with the patch as it stands.  It needs to have an
early exit at the top if parentestate is NULL, which it would be if
EvalPlanQualInit() wasn't called from an ExecInit*() function.

Though, as I answer below your question as to whether there is
actually any need to interrupt all of the ExecInit*() routines,
nothing needs to change in ExecEndLockRows().

> - In ExecEndMemoize, should there be a null-test around
> MemoryContextDelete(node->tableContext) as we have in
> ExecEndRecursiveUnion, ExecEndSetOp, etc.?

Oops, you're right.  Added.

> I wonder how we feel about setting pointers to NULL after freeing the
> associated data structures. The existing code isn't consistent about
> doing that, and making it do so would be a fairly large change that
> would bloat this patch quite a bit. On the other hand, I think it's a
> good practice as a general matter, and we do do it in some ExecEnd
> functions.

I agree that it might be worthwhile to take the opportunity and make
the code more consistent in this regard.  So, I've included those
changes too in 0002.

> On 0003, I have some doubt about whether we really have all the right
> design decisions in detail here:
>
> - Why have this weird rule where sometimes we return NULL and other
> times the planstate? Is there any point to such a coding rule? Why not
> just always return the planstate?
>
> - Is there any point to all of these early exit cases? For example, in
> ExecInitBitmapAnd, why exit early if initialization fails? Why not
> just plunge ahead and if initialization failed the caller will notice
> that and when we ExecEndNode some of the child node pointers will be
> NULL but who cares? The obvious disadvantage of this approach is that
> we're doing a bunch of unnecessary initialization, but we're also
> speeding up the common case where we don't need to abort by avoiding a
> branch that will rarely be taken. I'm not quite sure what the right
> thing to do is here.
>
> - The cases where we call ExecGetRangeTableRelation or
> ExecOpenScanRelation are a bit subtler ... maybe initialization that
> we're going to do later is going to barf if the tuple descriptor of
> the relation isn't what we thought it was going to be. In that case it
> becomes important to exit early. But if that's not actually a problem,
> then we could apply the same principle here also -- don't pollute the
> code with early-exit cases, just let it do its thing and sort it out
> later. Do you know what the actual problems would be here if we didn't
> exit early in these cases?
>
> - Depending on the answers to the above points, one thing we could
> think of doing is put an early exit case into ExecInitNode itself: if
> (unlikely(!ExecPlanStillValid(whatever)) return NULL. Maybe Andres or
> someone is going to argue that that checks too often and is thus too
> expensive, but it would be a lot more maintainable than having similar
> checks strewn throughout the ExecInit* functions. Perhaps it deserves
> some thought/benchmarking. More generally, if there's anything we can
> do to centralize these checks in fewer places, I think that would be
> worth considering. The patch isn't terribly large as it stands, so I
> don't necessarily think that this is a critical issue, but I'm just
> wondering if we can do better. I'm not even sure that it would be too
> expensive to just initialize the whole plan always, and then just do
> one test at the end. That's not OK if the changed tuple descriptor (or
> something else) is going to crash or error out in a funny way or
> something before initialization is completed, but if it's just going
> to result in burning a few CPU cycles in a corner case, I don't know
> if we should really care.

I thought about this some and figured that adding the
is-CachedPlan-still-valid tests in the following places should suffice
after all:

1. In InitPlan() right after the top-level ExecInitNode() calls
2. In ExecInit*() functions of Scan nodes, right after
ExecOpenScanRelation() calls

CachedPlans can only become invalid because of concurrent changes to
the inheritance child tables referenced in the plan.  Only the
following schema modifications of child tables are possible to be
performed concurrently:

* Addition of a column (allowed only if traditional inheritance child)
* Addition of an index
* Addition of a non-index constraint
* Dropping of a child table (allowed only if traditional inheritance child)
* Dropping of an index referenced in the plan

The first 3 are not destructive enough to cause crashes, weird errors
during ExecInit*(), though the last two can be, so the 2nd set of the
tests after ExecOpenScanRelation() mentioned above.

> - The "At this point" comments don't give any rationale for why we
> shouldn't have received any such invalidation messages. That  makes
> them fairly useless; the Assert by itself clarifies that you think
> that case shouldn't happen. The comment's job is to justify that
> claim.

I've rewritten the comments.

I'll post the updated set of patches shortly.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > - Is there any point to all of these early exit cases? For example, in
> > ExecInitBitmapAnd, why exit early if initialization fails? Why not
> > just plunge ahead and if initialization failed the caller will notice
> > that and when we ExecEndNode some of the child node pointers will be
> > NULL but who cares? The obvious disadvantage of this approach is that
> > we're doing a bunch of unnecessary initialization, but we're also
> > speeding up the common case where we don't need to abort by avoiding a
> > branch that will rarely be taken. I'm not quite sure what the right
> > thing to do is here.
> I thought about this some and figured that adding the
> is-CachedPlan-still-valid tests in the following places should suffice
> after all:
>
> 1. In InitPlan() right after the top-level ExecInitNode() calls
> 2. In ExecInit*() functions of Scan nodes, right after
> ExecOpenScanRelation() calls

After sleeping on this, I think we do need the checks after all the
ExecInitNode() calls too, because we have many instances of the code
like the following one:

    outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
    tupDesc = ExecGetResultType(outerPlanState(gatherstate));
    <some code that dereferences outDesc>

If outerNode is a SeqScan and ExecInitSeqScan() returned early because
ExecOpenScanRelation() detected that plan was invalidated, then
tupDesc would be NULL in this case, causing the code to crash.

Now one might say that perhaps we should only add the
is-CachedPlan-valid test in the instances where there is an actual
risk of such misbehavior, but that could lead to confusion, now or
later.  It seems better to add them after every ExecInitNode() call
while we're inventing the notion, because doing so relieves the
authors of future enhancements of the ExecInit*() routines from
worrying about any of this.

Attached 0003 should show how that turned out.

Updated 0002 as mentioned in the previous reply -- setting pointers to
NULL after freeing them more consistently across various ExecEnd*()
routines and using the `if (pointer != NULL)` style over the `if
(pointer)` more consistently.

Updated 0001's commit message to remove the mention of its relation to
any future commits.  I intend to push it tomorrow.

Patches 0004 onwards contain changes too, mainly in terms of moving
the code around from one patch to another, but I'll omit the details
of the specific change for now.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Sep 25, 2023 at 9:57 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Sep 6, 2023 at 11:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > - Is there any point to all of these early exit cases? For example, in
> > > ExecInitBitmapAnd, why exit early if initialization fails? Why not
> > > just plunge ahead and if initialization failed the caller will notice
> > > that and when we ExecEndNode some of the child node pointers will be
> > > NULL but who cares? The obvious disadvantage of this approach is that
> > > we're doing a bunch of unnecessary initialization, but we're also
> > > speeding up the common case where we don't need to abort by avoiding a
> > > branch that will rarely be taken. I'm not quite sure what the right
> > > thing to do is here.
> > I thought about this some and figured that adding the
> > is-CachedPlan-still-valid tests in the following places should suffice
> > after all:
> >
> > 1. In InitPlan() right after the top-level ExecInitNode() calls
> > 2. In ExecInit*() functions of Scan nodes, right after
> > ExecOpenScanRelation() calls
>
> After sleeping on this, I think we do need the checks after all the
> ExecInitNode() calls too, because we have many instances of the code
> like the following one:
>
>     outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
>     tupDesc = ExecGetResultType(outerPlanState(gatherstate));
>     <some code that dereferences outDesc>
>
> If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> ExecOpenScanRelation() detected that plan was invalidated, then
> tupDesc would be NULL in this case, causing the code to crash.
>
> Now one might say that perhaps we should only add the
> is-CachedPlan-valid test in the instances where there is an actual
> risk of such misbehavior, but that could lead to confusion, now or
> later.  It seems better to add them after every ExecInitNode() call
> while we're inventing the notion, because doing so relieves the
> authors of future enhancements of the ExecInit*() routines from
> worrying about any of this.
>
> Attached 0003 should show how that turned out.
>
> Updated 0002 as mentioned in the previous reply -- setting pointers to
> NULL after freeing them more consistently across various ExecEnd*()
> routines and using the `if (pointer != NULL)` style over the `if
> (pointer)` more consistently.
>
> Updated 0001's commit message to remove the mention of its relation to
> any future commits.  I intend to push it tomorrow.

Pushed that one.  Here are the rebased patches.

0001 seems ready to me, but I'll wait a couple more days for others to
weigh in.  Just to highlight a kind of change that others may have
differing opinions on, consider this hunk from the patch:

-   MemoryContextDelete(node->aggcontext);
+   if (node->aggcontext != NULL)
+   {
+       MemoryContextDelete(node->aggcontext);
+       node->aggcontext = NULL;
+   }
...
+   ExecEndNode(outerPlanState(node));
+   outerPlanState(node) = NULL;

So the patch wants to enhance the consistency of setting the pointer
to NULL after freeing part.  Robert mentioned his preference for doing
it in the patch, which I agree with.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Sep 28, 2023 at 5:26 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > After sleeping on this, I think we do need the checks after all the
> > ExecInitNode() calls too, because we have many instances of the code
> > like the following one:
> >
> >     outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
> >     tupDesc = ExecGetResultType(outerPlanState(gatherstate));
> >     <some code that dereferences outDesc>
> >
> > If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> > ExecOpenScanRelation() detected that plan was invalidated, then
> > tupDesc would be NULL in this case, causing the code to crash.
> >
> > Now one might say that perhaps we should only add the
> > is-CachedPlan-valid test in the instances where there is an actual
> > risk of such misbehavior, but that could lead to confusion, now or
> > later.  It seems better to add them after every ExecInitNode() call
> > while we're inventing the notion, because doing so relieves the
> > authors of future enhancements of the ExecInit*() routines from
> > worrying about any of this.
> >
> > Attached 0003 should show how that turned out.
> >
> > Updated 0002 as mentioned in the previous reply -- setting pointers to
> > NULL after freeing them more consistently across various ExecEnd*()
> > routines and using the `if (pointer != NULL)` style over the `if
> > (pointer)` more consistently.
> >
> > Updated 0001's commit message to remove the mention of its relation to
> > any future commits.  I intend to push it tomorrow.
>
> Pushed that one.  Here are the rebased patches.
>
> 0001 seems ready to me, but I'll wait a couple more days for others to
> weigh in.  Just to highlight a kind of change that others may have
> differing opinions on, consider this hunk from the patch:
>
> -   MemoryContextDelete(node->aggcontext);
> +   if (node->aggcontext != NULL)
> +   {
> +       MemoryContextDelete(node->aggcontext);
> +       node->aggcontext = NULL;
> +   }
> ...
> +   ExecEndNode(outerPlanState(node));
> +   outerPlanState(node) = NULL;
>
> So the patch wants to enhance the consistency of setting the pointer
> to NULL after freeing part.  Robert mentioned his preference for doing
> it in the patch, which I agree with.

Rebased.

I haven't been able to reproduce and debug a crash reported by cfbot
that I see every now and then:

https://cirrus-ci.com/task/5673432591892480?logs=cores#L0

[22:46:12.328] Program terminated with signal SIGSEGV, Segmentation fault.
[22:46:12.328] Address not mapped to object.
[22:46:12.838] #0 afterTriggerInvokeEvents
(events=events@entry=0x836db0460, firing_id=1,
estate=estate@entry=0x842eec100, delete_ok=<optimized out>) at
../src/backend/commands/trigger.c:4656
[22:46:12.838] #1 0x00000000006c67a8 in AfterTriggerEndQuery
(estate=estate@entry=0x842eec100) at
../src/backend/commands/trigger.c:5085
[22:46:12.838] #2 0x000000000065bfba in CopyFrom (cstate=0x836df9038)
at ../src/backend/commands/copyfrom.c:1293
...

While a patch in this series does change
src/backend/commands/trigger.c, I'm not yet sure about its relation
with the backtrace shown there.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: generic plans and "initial" pruning

From
Robert Haas
Date:
Reviewing 0001:

Perhaps ExecEndCteScan needs an adjustment. What if node->leader was never set?

Other than that, I think this is in good shape. Maybe there are other
things we'd want to adjust here, or maybe there aren't, but there
doesn't seem to be any good reason to bundle more changes into the
same patch.

Reviewing 0002 and beyond:

I think it's good that you have tried to divide up a big change into
little pieces, but I'm finding the result difficult to understand. It
doesn't really seem like each patch stands on its own. I keep flipping
between patches to try to understand why other patches are doing
things, which kind of defeats the purpose of splitting stuff up. For
example, 0002 adds a NodeTag field to QueryDesc, but it doesn't even
seem to initialize that field, let alone use it for anything. It adds
a CachedPlan pointer to QueryDesc too, and adapts CreateQueryDesc to
allow one as an argument, but none of the callers actually pass
anything. I suspect that that the first change (adding a NodeTag)
field is a bug, and that the second one is intentional, but it's hard
to tell without flipping through all of the other patches to see how
they build on what 0002 does. And even when something isn't a bug,
it's also hard to tell whether it's the right design, again because
you can't consider each patch in isolation. Ideally, splitting a patch
set should bring related changes together in a single patch and push
unrelated changes apart into different patches, but I don't really see
this particular split having that effect.

There is a chicken and egg problem here, to be fair. If we add code
that can make plan initialization fail without teaching the planner to
cope with failures, then we have broken the server, and if we do the
reverse, then we have a bunch of dead code that we can't test. Neither
is very satisfactory. But I still hope there's some better division
possible than what you have here currently. For instance, I wonder if
it would be possible to add all the stuff to cope with plan
initialization failing and then have a test patch that makes
initialization randomly fail with some probability (or maybe you can
even cause failures at specific points). Then you could test that
infrastructure by running the regression tests in a loop with various
values of the relevant setting.

Another overall comment that I have is that it doesn't feel like
there's enough high-level explanation of the design. I don't know how
much of that should go in comments vs. commit messages vs. a README
that accompanies the patch set vs. whatever else, and I strongly
suspect that some of the stuff that seems confusing now is actually
stuff that at one point I understood and have just forgotten about.
But rediscovering it shouldn't be quite so hard. For example, consider
the question "why are we storing the CachedPlan in the QueryDesc?" I
eventually figured out that it's so that ExecPlanStillValid can call
CachedPlanStillValid which can then consult the cached plan's is_valid
flag. But is that the only access to the CachedPlan that we ever
expect to occur via the QueryDesc? If not, what else is allowable? If
so, why not just store a Boolean in the QueryDesc and arrange for the
plancache to be able to flip it when invalidating? I'm not saying
that's a better design -- I'm saying that it looks hard to understand
your thought process from the patch set. And also, you know, assuming
the current design is correct, could there be some way of dividing up
the patch set so that this one change, where we add the CachedPlan to
the QueryDesc, isn't so spread out across the whole series?

Some more detailed review comments below. This isn't really a full
review because I don't understand the patches well enough for that,
but it's some stuff I noticed.

In 0002:

+     * result-rel info, etc.  Also, we don't pass the parent't copy of the

Typo.

+        /*
+         * All the necessary locks must already have been taken when
+         * initializing the parent's copy of subplanstate, so the CachedPlan,
+         * if any, should not have become invalid during ExecInitNode().
+         */
+        Assert(ExecPlanStillValid(rcestate));

This -- and the other similar instance -- feel very uncomfortable.
There's a lot of action at a distance here. If this assertion ever
failed, how would anyone ever figure out what went wrong? You wouldn't
for example know which object got invalidated, presumably
corresponding to a lock that you failed to take. Unless the problem
were easily reproducible in a test environment, trying to guess what
happened might be pretty awful; imagine seeing this assertion failure
in a customer log file and trying to back-track to the find the
underlying bug. A further problem is that what would actually happen
is you *wouldn't* see this in the customer log file, because
assertions wouldn't be enabled, so you'd just see queries occasionally
returning wrong answers, I guess? Or crashing in some other random
part of the code? Which seems even worse. At a minimum I think this
should be upgraded to a test-and-elog, and maybe there's some value in
trying to think of what should get printed by that elog to facilitate
proper debugging, if it happens.

In 0003:

+                *
+                * OK to ignore the return value; plan can't become invalid,
+                * because there's no CachedPlan.
                 */
-               ExecutorStart(cstate->queryDesc, 0);
+               (void) ExecutorStart(cstate->queryDesc, 0);

This also feels awkward, for similar reasons. Sure, it shouldn't
return false, but also, if it did, you'd just blindly continue. Maybe
there should be test-and-elog here too. Or maybe this is an indication
that we need less action at a distance. Like, if ExecutorStart took
the CachedPlan as an argument instead of feeding it through the
QueryDesc, then you could document that ExecutorStart returns true if
that value is passed as NULL and true or false otherwise. Here,
whether ExecutorStart can return true or false depends on the contents
of the queryDesc ... which, granted, in this case is just built a line
or two before anyway, but if you just passed to to ExecutorStart then
you wouldn't need to feed it through the QueryDesc, it seems to me.
Even better, maybe there should be ExecutorStart() that continues
returning void and ExecutorStartExtended() that takes a cached plan as
an additional argument and returns a bool.

        /*
-        * Check that ExecutorFinish was called, unless in
EXPLAIN-only mode. This
-        * Assert is needed because ExecutorFinish is new as of 9.1, and callers
-        * might forget to call it.
+        * Check that ExecutorFinish was called, unless in
EXPLAIN-only mode or if
+        * execution was canceled. This Assert is needed because
ExecutorFinish is
+        * new as of 9.1, and callers might forget to call it.
         */

Maybe we could drop the second sentence at this point.

In 0005:

+                        * XXX Maybe we should we skip calling
ExecCheckPermissions from
+                        * InitPlan in a parallel worker.

Why? If the thinking is to save overhead, then perhaps try to assess
the overhead. If the thinking is that we don't want it to fail
spuriously, then we have to weight that against the (security) risk of
succeeding spuriously.

+ * Returns true if current transaction holds a lock on the given relation of
+ * mode 'lockmode'.  If 'orstronger' is true, a stronger lockmode is also OK.
+ * ("Stronger" is defined as "numerically higher", which is a bit
+ * semantically dubious but is OK for the purposes we use this for.)

I don't particularly enjoy seeing this comment cut and pasted into
some new place. Especially the tongue-in-cheek parenthetical part.
Better to refer to the original comment or something instead of
cut-and-pasting. Also, why is it appropriate to pass orstronger = true
here? Don't we expect the *exact* lock mode that we have planned to be
held, and isn't it a sure sign of a bug if it isn't? Maybe orstronger
should just be ripped out here (and the comment could then go away
too).

In 0006:

+       /*
+        * RTIs of all partitioned tables whose children are scanned by
+        * appendplans. The list contains a bitmapset for every partition tree
+        * covered by this Append.
+        */

The first sentence of this comment makes this sound like a list of
integers, the RTIs of all partitioned tables that are scanned. The
second sentence makes it sound like a list of bitmapsets, but what
does it mean to take about each partition tree covered by this Append?

This is far from a complete review but I'm running out of steam for
today. I hope that it's at least somewhat useful.

...Robert



Re: generic plans and "initial" pruning

From
vignesh C
Date:
On Mon, 20 Nov 2023 at 10:00, Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Thu, Sep 28, 2023 at 5:26 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Sep 26, 2023 at 10:06 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > > After sleeping on this, I think we do need the checks after all the
> > > ExecInitNode() calls too, because we have many instances of the code
> > > like the following one:
> > >
> > >     outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
> > >     tupDesc = ExecGetResultType(outerPlanState(gatherstate));
> > >     <some code that dereferences outDesc>
> > >
> > > If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> > > ExecOpenScanRelation() detected that plan was invalidated, then
> > > tupDesc would be NULL in this case, causing the code to crash.
> > >
> > > Now one might say that perhaps we should only add the
> > > is-CachedPlan-valid test in the instances where there is an actual
> > > risk of such misbehavior, but that could lead to confusion, now or
> > > later.  It seems better to add them after every ExecInitNode() call
> > > while we're inventing the notion, because doing so relieves the
> > > authors of future enhancements of the ExecInit*() routines from
> > > worrying about any of this.
> > >
> > > Attached 0003 should show how that turned out.
> > >
> > > Updated 0002 as mentioned in the previous reply -- setting pointers to
> > > NULL after freeing them more consistently across various ExecEnd*()
> > > routines and using the `if (pointer != NULL)` style over the `if
> > > (pointer)` more consistently.
> > >
> > > Updated 0001's commit message to remove the mention of its relation to
> > > any future commits.  I intend to push it tomorrow.
> >
> > Pushed that one.  Here are the rebased patches.
> >
> > 0001 seems ready to me, but I'll wait a couple more days for others to
> > weigh in.  Just to highlight a kind of change that others may have
> > differing opinions on, consider this hunk from the patch:
> >
> > -   MemoryContextDelete(node->aggcontext);
> > +   if (node->aggcontext != NULL)
> > +   {
> > +       MemoryContextDelete(node->aggcontext);
> > +       node->aggcontext = NULL;
> > +   }
> > ...
> > +   ExecEndNode(outerPlanState(node));
> > +   outerPlanState(node) = NULL;
> >
> > So the patch wants to enhance the consistency of setting the pointer
> > to NULL after freeing part.  Robert mentioned his preference for doing
> > it in the patch, which I agree with.
>
> Rebased.

There is a leak reported at [1], details for the same is available at [2]:
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out
2023-12-19 23:00:04.677385000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out
2023-12-19 23:06:26.870259000 +0000
@@ -1288,6 +1288,7 @@
       (102, '2011-10-12', 120),
       (102, '2011-10-28', 200),
       (103, '2011-10-15', 480);
+WARNING:  resource was not closed: relation "customer_pkey"
 CREATE VIEW my_property_normal AS
        SELECT * FROM customer WHERE name = current_user;
 CREATE VIEW my_property_secure WITH (security_barrier) A

[1] - https://cirrus-ci.com/task/6494009196019712
[2] -
https://api.cirrus-ci.com/v1/artifact/task/6494009196019712/testrun/build/testrun/regress-running/regress/regression.diffs

Regards,
Vingesh



Re: generic plans and "initial" pruning

From
"Andrey M. Borodin"
Date:

> On 6 Dec 2023, at 23:52, Robert Haas <robertmhaas@gmail.com> wrote:
>
>  I hope that it's at least somewhat useful.
>


> On 5 Jan 2024, at 15:46, vignesh C <vignesh21@gmail.com> wrote:
>
> There is a leak reported

Hi Amit,

this is a kind reminder that some feedback on your patch[0] is waiting for your reply.
Thank you for your work!

Best regards, Andrey Borodin.


[0] https://commitfest.postgresql.org/47/3478/


Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Andrey,

On Sun, Mar 31, 2024 at 2:03 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > On 6 Dec 2023, at 23:52, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> >  I hope that it's at least somewhat useful.
>
> > On 5 Jan 2024, at 15:46, vignesh C <vignesh21@gmail.com> wrote:
> >
> > There is a leak reported
>
> Hi Amit,
>
> this is a kind reminder that some feedback on your patch[0] is waiting for your reply.
> Thank you for your work!

Thanks for moving this to the next CF.

My apologies (especially to Robert) for not replying on this thread
for a long time.

I plan to start working on this soon.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Fri, 20 Jan 2023 at 08:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent some time re-reading this whole thread, and the more I read
> the less happy I got.  We are adding a lot of complexity and introducing
> coding hazards that will surely bite somebody someday.  And after awhile
> I had what felt like an epiphany: the whole problem arises because the
> system is wrongly factored.  We should get rid of AcquireExecutorLocks
> altogether, allowing the plancache to hand back a generic plan that
> it's not certain of the validity of, and instead integrate the
> responsibility for acquiring locks into executor startup.  It'd have
> to be optional there, since we don't need new locks in the case of
> executing a just-planned plan; but we can easily add another eflags
> bit (EXEC_FLAG_GET_LOCKS or so).  Then there has to be a convention
> whereby the ExecInitNode traversal can return an indicator that
> "we failed because the plan is stale, please make a new plan".

I also reread the entire thread up to this point yesterday. I've also
been thinking about this recently as Amit has mentioned it to me a few
times over the past few months.

With the caveat of not yet having looked at the latest patch, my
thoughts are that having the executor startup responsible for taking
locks is a bad idea and I don't think we should go down this path. My
reasons are:

1. No ability to control the order that the locks are obtained. The
order in which the locks are taken will be at the mercy of the plan
the planner chooses.
2. It introduces lots of complexity regarding how to cleanly clean up
after a failed executor startup which is likely to make exec startup
slower and the code more complex
3. It puts us even further down the path of actually needing an
executor startup phase.

For #1, the locks taken for SELECT queries are less likely to conflict
with other locks obtained by PostgreSQL, but at least at the moment if
someone is getting deadlocks with a DDL type operation, they can
change their query or DDL script so that locks are taken in the same
order.  If we allowed executor startup to do this then if someone
comes complaining that PG18 deadlocks when PG17 didn't we'd just have
to tell them to live with it.  There's a comment at the bottom of
find_inheritance_children_extended() just above the qsort() which
explains about the deadlocking issue.

I don't have much extra to say about #2.  As mentioned, I've not
looked at the patch. On paper, it sounds possible, but it also sounds
bug-prone and ugly.

For #3, I've been thinking about what improvements we can do to make
the executor more efficient. In [1], Andres talks about some very
interesting things. In particular, in his email items 3) and 5) are
relevant here. If we did move lots of executor startup code into the
planner, I think it would be possible to one day get rid of executor
startup and have the plan record how much memory is needed for the
non-readonly part of the executor state and tag each plan node with
the offset in bytes they should use for their portion of the executor
working state. This would be a single memory allocation for the entire
plan.  The exact details are not important here, but I feel like if we
load up executor startup with more responsibilities, it'll just make
doing something like this harder.  The init run-time pruning code that
I worked on likely already has done that, but I don't think it's
closed the door on it as it might just mean allocating more executor
state memory than we need to. Providing the plan node records the
offset into that memory, I think it could be made to work, just with
the inefficiency of having a (possibly) large unused hole in that
state memory.

As far as I understand it, your objection to the original proposal is
just on the grounds of concerns about introducing hazards that could
turn into bugs.  I think we could come up with some way to make the
prior method of doing pruning before executor startup work. I think
what Amit had before your objection was starting to turn into
something workable and we should switch back to working on that.

David

[1] https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> With the caveat of not yet having looked at the latest patch, my
> thoughts are that having the executor startup responsible for taking
> locks is a bad idea and I don't think we should go down this path.

OK, it's certainly still up for argument, but ...

> 1. No ability to control the order that the locks are obtained. The
> order in which the locks are taken will be at the mercy of the plan
> the planner chooses.

I do not think I buy this argument, because plancache.c doesn't
provide any "ability to control the order" today, and never has.
The order in which AcquireExecutorLocks re-gets relation locks is only
weakly related to the order in which the parser/planner got them
originally.  The order in which AcquirePlannerLocks re-gets the locks
is even less related to the original.  This doesn't cause any big
problems that I'm aware of, because these locks are fairly weak.

I think we do have a guarantee that for partitioned tables, parents
will be locked before children, and that's probably valuable.
But an executor-driven lock order could preserve that property too.

> 2. It introduces lots of complexity regarding how to cleanly clean up
> after a failed executor startup which is likely to make exec startup
> slower and the code more complex

Perhaps true, I'm not sure.  But the patch we'd been discussing
before this proposal was darn complex as well.

> 3. It puts us even further down the path of actually needing an
> executor startup phase.

Huh?  We have such a thing already.

> For #1, the locks taken for SELECT queries are less likely to conflict
> with other locks obtained by PostgreSQL, but at least at the moment if
> someone is getting deadlocks with a DDL type operation, they can
> change their query or DDL script so that locks are taken in the same
> order.  If we allowed executor startup to do this then if someone
> comes complaining that PG18 deadlocks when PG17 didn't we'd just have
> to tell them to live with it.  There's a comment at the bottom of
> find_inheritance_children_extended() just above the qsort() which
> explains about the deadlocking issue.

The reason it's important there is that function is (sometimes)
used for lock modes that *are* exclusive.

> For #3, I've been thinking about what improvements we can do to make
> the executor more efficient. In [1], Andres talks about some very
> interesting things. In particular, in his email items 3) and 5) are
> relevant here. If we did move lots of executor startup code into the
> planner, I think it would be possible to one day get rid of executor
> startup and have the plan record how much memory is needed for the
> non-readonly part of the executor state and tag each plan node with
> the offset in bytes they should use for their portion of the executor
> working state.

I'm fairly skeptical about that idea.  The entire reason we have an
issue here is that we want to do runtime partition pruning, which
by definition can't be done at plan time.  So I doubt it's going
to play nice with what we are trying to accomplish in this thread.

Moreover, while "replace a bunch of small pallocs with one big one"
would save some palloc effort, what are you going to do to ensure
that that memory has the right initial contents?  I think this idea is
likely to make the executor a great deal more notationally complex
without actually buying all that much.  Maybe Andres can make it work,
but I don't want to contort other parts of the system design on the
purely hypothetical basis that this might happen.

> I think what Amit had before your objection was starting to turn into
> something workable and we should switch back to working on that.

The reason I posted this idea was that I didn't think the previously
existing patch looked promising at all.

            regards, tom lane



Re: generic plans and "initial" pruning

From
David Rowley
Date:
On Sun, 19 May 2024 at 13:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > 1. No ability to control the order that the locks are obtained. The
> > order in which the locks are taken will be at the mercy of the plan
> > the planner chooses.
>
> I do not think I buy this argument, because plancache.c doesn't
> provide any "ability to control the order" today, and never has.
> The order in which AcquireExecutorLocks re-gets relation locks is only
> weakly related to the order in which the parser/planner got them
> originally.  The order in which AcquirePlannerLocks re-gets the locks
> is even less related to the original.  This doesn't cause any big
> problems that I'm aware of, because these locks are fairly weak.

It may not bite many people, it's just that if it does, I don't see
what we could do to help those people. At the moment we could tell
them to adjust their DDL script to obtain the locks in the same order
as their query.  With your idea that cannot be done as the order could
change when the planner switches the join order.

> I think we do have a guarantee that for partitioned tables, parents
> will be locked before children, and that's probably valuable.
> But an executor-driven lock order could preserve that property too.

I think you'd have to lock the parent before the child. That would
remain true and consistent anyway when taking locks during a
breadth-first plan traversal.

> > For #3, I've been thinking about what improvements we can do to make
> > the executor more efficient. In [1], Andres talks about some very
> > interesting things. In particular, in his email items 3) and 5) are
> > relevant here. If we did move lots of executor startup code into the
> > planner, I think it would be possible to one day get rid of executor
> > startup and have the plan record how much memory is needed for the
> > non-readonly part of the executor state and tag each plan node with
> > the offset in bytes they should use for their portion of the executor
> > working state.
>
> I'm fairly skeptical about that idea.  The entire reason we have an
> issue here is that we want to do runtime partition pruning, which
> by definition can't be done at plan time.  So I doubt it's going
> to play nice with what we are trying to accomplish in this thread.

I think we could have both, providing there was a way to still
traverse the executor state tree in EXPLAIN. We'd need a way to skip
portions of the plan that are not relevant or could be invalid for the
current execution. e.g can't show Index Scan because index has been
dropped.

> > I think what Amit had before your objection was starting to turn into
> > something workable and we should switch back to working on that.
>
> The reason I posted this idea was that I didn't think the previously
> existing patch looked promising at all.

Ok.  It would be good if you could expand on that so we could
determine if there's some fundamental reason it can't work or if
that's because you were blinded by your epiphany and didn't give that
any thought after thinking of the alternative idea.

I've gone to effort to point out things that I think are concerning
with your idea. It would be good if you could do the same for the
previous patch other than "it didn't look promising". It's pretty hard
for me to argue with that level of detail.

David



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Sun, May 19, 2024 at 9:39 AM David Rowley <dgrowleyml@gmail.com> wrote:
> For #1, the locks taken for SELECT queries are less likely to conflict
> with other locks obtained by PostgreSQL, but at least at the moment if
> someone is getting deadlocks with a DDL type operation, they can
> change their query or DDL script so that locks are taken in the same
> order.  If we allowed executor startup to do this then if someone
> comes complaining that PG18 deadlocks when PG17 didn't we'd just have
> to tell them to live with it.  There's a comment at the bottom of
> find_inheritance_children_extended() just above the qsort() which
> explains about the deadlocking issue.

Thought to chime in on this.

A deadlock may occur with the execution-time locking proposed in the
patch if the DDL script makes assumptions about how a cached plan's
execution determines the locking order for children of multiple parent
relations. Specifically, the deadlock can happen if the script tries
to lock the child relations directly, instead of locking them through
their respective parent relations.  The patch doesn't change the order
of locking of relations mentioned in the query, because that's defined
in AcquirePlannerLocks().

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Alvaro Herrera
Date:
I had occasion to run the same benchmark you described in the initial
email in this thread.  To do so I applied patch series v49 on top of
07cb29737a4e, which is just one that happened to have the same date as
v49.

I then used a script like this (against a server having
plan_cache_mode=force_generic_mode)

for numparts in 0 1 2 4 8 16 32 48 64 80 81 96 127 128 160 200 256 257 288 300 384 512 1024 1536 2048;  do
    pgbench testdb -i --partitions=$numparts 2>/dev/null
    echo -ne "$numparts\t"
    pgbench -n testdb -S -T30 -Mprepared | grep "^tps" | sed -e 's/^tps = \([0-9.]*\) .*/\1/'
done

and did the same with the commit mentioned above (that is, unpatched).
I got this table as result

 partitions │   patched    │  07cb29737a  
────────────┼──────────────┼──────────────
          0 │ 65632.090431 │ 68967.712741
          1 │ 68096.641831 │ 65356.587223
          2 │ 59456.507575 │ 60884.679464
          4 │    62097.426 │ 59698.747104
          8 │ 58044.311175 │ 57817.104562
         16 │ 59741.926563 │ 52549.916262
         32 │ 59261.693449 │ 44815.317215
         48 │ 59047.125629 │ 38362.123652
         64 │ 59748.738797 │ 34051.158525
         80 │ 59276.839183 │ 32026.135076
         81 │ 62318.572932 │ 30418.122933
         96 │ 59678.857163 │ 28478.113651
        127 │ 58761.960028 │ 24272.303742
        128 │ 59934.268306 │ 24275.214593
        160 │ 56688.790899 │ 21119.043564
        200 │ 56323.188599 │ 18111.212849
        256 │  55915.22466 │ 14753.953709
        257 │ 57810.530461 │ 15093.497575
        288 │ 56874.780092 │ 13873.332162
        300 │ 57222.056549 │ 13463.768946
        384 │  54073.77295 │ 11183.558339
        512 │ 37503.766847 │   8114.32532
       1024 │ 42746.866448 │   4468.41359
       1536 │  39500.58411 │  3049.984599
       2048 │ 36988.519486 │  2269.362006

where already at 16 partitions we can see that things are going downhill
with the unpatched code.  (However, what happens when the table is not
partitioned looks a bit funny.)

I hope we can get this new executor code in 18.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Jun 20, 2024 at 2:09 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I hope we can get this new executor code in 18.

Thanks for doing the benchmark, Alvaro, and sorry for the late reply.

Yes, I'm hoping to get *some* version of this into v18.  I've been
thinking how to move this forward and I'm starting to think that we
should go back to or at least consider as an option the old approach
of changing the plancache to do the initial runtime pruning instead of
changing the executor to take locks, which is the design that the
latest patch set tries to implement.

Here are the challenges facing the implementation of the current design:

1. I went through many iterations of the changes to ExecInitNode() to
return a partially initialized PlanState tree when it detects that the
CachedPlan was invalidated after locking a child table and to
ExecEndNode() to account for the PlanState tree sometimes being
partially initialized, but it still seems fragile and bug-prone to me.
It might be because this approach is fundamentally hard to get right
or I haven't invested enough effort in becoming more confident in its
robustness.

2. Refactoring needed due to the ExecutorStart() API change especially
that pertaining to portals does not seem airtight.  I'm especially
worried about moving the ExecutorStart() call for the
PORTAL_MULTI_QUERY case from where it is currently to PortalStart().
That requires additional bookkeeping in PortalData and I am not
totally sure that the snapshot handling changes after that move are
entirely correct.

3. The need to add *back* the fields to store the RT indexes of
relations that are not looked at by ExecInitNode() traversal such as
root partitioned tables and non-leaf partitions.

I'm worried about #2 the most.  One complaint about the previous
design was that the interface changes to capture and pass the result
of doing initial pruning in plancache.c to the executor did not look
great.  However, after having tried doing #2, the changes to pass the
pruning result into the executor and changes to reuse it in
ExecInit[Merge]Append() seem a tad bit simpler than the refactoring
and adjustments needed to handle failed ExecutorStart() calls, at
multiple code sites.

About #1, I tend to agree with David that adding complexity around
PlanState tree construction may not be a good idea, because we might
want to rethink Plan initialization code and data structures in the
not too distant future.  One idea I thought of is to take the
remaining locks (to wit, those on inheritance children if running a
cached plan) at the beginning of InitPlan(), that is before
ExecInitNode(), like we handle the permission checking, so that we
don't need to worry about ever returning a partially initialized
PlanState tree.  However, we're still left with the tall task to
implement #2 such that it doesn't break anything.

Another concern about the old design was the unnecessary overhead of
initializing bitmapset fields in PlannedStmt that are meant for the
locking algorithm in AcquireExecutorLocks().  Andres suggested an idea
offlist to either piggyback on cursorOptions argument of
pg_plan_queries() or adding a new boolean parameter to let the planner
know if the plan is one that might get cached and thus have
AcquireExecutorLocks() called on it.  Another idea David and I
discussed offlist is inventing a RTELockInfo (cf RTEPermissionInfo)
and only creating one for each RT entry that is un-prunable and do
away with PlannedStmt.rtable.  For partitioned tables, that entry will
point to the PartitionPruneInfo that will contain the RT indexes of
partitions (or maybe just OIDs) mapped from their subplan indexes that
are returned by the pruning code.  So AcquireExecutorLocks() will lock
all un-prunable relations by referring to their RTELockInfo entries
and for each entry that points to a PartitionPruneInfo with initial
pruning steps, will only lock the partitions that survive the pruning.

I am planning to polish that old patch set and post after playing with
those new ideas.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Mon, Aug 12, 2024 at 8:54 AM Amit Langote <amitlangote09@gmail.com> wrote:
> 1. I went through many iterations of the changes to ExecInitNode() to
> return a partially initialized PlanState tree when it detects that the
> CachedPlan was invalidated after locking a child table and to
> ExecEndNode() to account for the PlanState tree sometimes being
> partially initialized, but it still seems fragile and bug-prone to me.
> It might be because this approach is fundamentally hard to get right
> or I haven't invested enough effort in becoming more confident in its
> robustness.

Can you give some examples of what's going wrong, or what you think
might go wrong?

I didn't think there was a huge problem here based on previous
discussion, but I could very well be missing some important challenge.

> 2. Refactoring needed due to the ExecutorStart() API change especially
> that pertaining to portals does not seem airtight.  I'm especially
> worried about moving the ExecutorStart() call for the
> PORTAL_MULTI_QUERY case from where it is currently to PortalStart().
> That requires additional bookkeeping in PortalData and I am not
> totally sure that the snapshot handling changes after that move are
> entirely correct.

Here again, it would help to see exactly what you had to do and what
consequences you think it might have. But it sounds like you're
talking about moving ExecutorStart() from PortalStart() to PortalRun()
and I agree that sounds like it might have user-visible behavioral
consequences that we don't want.

> 3. The need to add *back* the fields to store the RT indexes of
> relations that are not looked at by ExecInitNode() traversal such as
> root partitioned tables and non-leaf partitions.

I don't remember exactly why we removed those or what the benefit was,
so I'm not sure how big of a problem it is if we have to put them
back.

> About #1, I tend to agree with David that adding complexity around
> PlanState tree construction may not be a good idea, because we might
> want to rethink Plan initialization code and data structures in the
> not too distant future.

Like Tom, I don't really buy this. There might be a good reason not to
do this in ExecutorStart(), but the hypothetical possibility that we
might want to change something and that this patch might make it
harder is not it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Thu, Aug 15, 2024 at 8:57 AM Amit Langote <amitlangote09@gmail.com> wrote:
> TBH, it's more of a hunch that people who are not involved in this
> development might find the new reality, whereby the execution is not
> racefree until ExecutorRun(), hard to reason about.

I'm confused by what you mean here by "racefree". A race means
multiple sessions are doing stuff at the same time and the result
depends on who does what first, but the executor stuff is all
backend-private. Heavyweight locks are not backend-private, but those
would be taken in ExectorStart(), not ExecutorRun(), IIUC.

> With the patch, CreateQueryDesc() and ExecutorStart() are moved to
> PortalStart() so that QueryDescs including the PlanState trees for all
> queries are built before any is run.  Why?  So that if ExecutorStart()
> fails for any query in the list, we can simply throw out the QueryDesc
> and the PlanState trees of the previous queries (NOT run them) and ask
> plancache for a new CachedPlan for the list of queries.  We don't have
> a way to ask plancache.c to replan only a given query in the list.

I agree that moving this from PortalRun() to PortalStart() seems like
a bad idea, especially in view of what you write below.

> * There's no longer CCI() between queries in PortalRunMulti() because
> the snapshots in each query's QueryDesc must have been adjusted to
> reflect the correct command counter.  I've checked but can't really be
> sure if the value in the snapshot is all anyone ever uses if they want
> to know the current value of the command counter.

I don't think anything stops somebody wanting to look at the current
value of the command counter. I also don't think you can remove the
CommandCounterIncrement() calls between successive queries, because
then they won't see the effects of earlier calls. So this sounds
broken to me.

Also keep in mind that one of the queries could call a function which
does something that bumps the command counter again. I'm not sure if
that creates its own hazzard separate from the lack of CCIs, or
whether it's just another part of that same issue. But you can't
assume that each query's snapshot should have a command counter value
one more than the previous query.

While this all seems bad for the partially-initialized-execution-tree
approach, I wonder if you don't have problems here with the other
design, too. Let's say you've the multi-query case and there are 2
queries. The first one (Q1) is SELECT mysterious_function() and the
second one (Q2) is SELECT * FROM range_partitioned_table WHERE
key_column = 42. What if mysterious_function() performs DDL on
range_partitioned_table? I haven't tested this so maybe there are
things going on here that prevent trouble, but it seems like executing
Q1 can easily invalidate the plan for Q2. And then it seems like
you're basically back to the same problem.

> > > 3. The need to add *back* the fields to store the RT indexes of
> > > relations that are not looked at by ExecInitNode() traversal such as
> > > root partitioned tables and non-leaf partitions.
> >
> > I don't remember exactly why we removed those or what the benefit was,
> > so I'm not sure how big of a problem it is if we have to put them
> > back.
>
> We removed those in commit 52ed730d511b after commit f2343653f5b2
> removed redundant execution-time locking of non-leaf relations.  So we
> removed them because we realized that execution time locking is
> unnecessary given that AcquireExecutorLocks() exists and now we want
> to add them back because we'd like to get rid of
> AcquireExecutorLocks(). :-)

My bias is to believe that getting rid of AcquireExecutorLocks() is
probably the right thing to do, but that's not a strongly-held
position and I could be totally wrong about it. The thing is, though,
that AcquireExecutorLocks() is fundamentally stupid, and it's hard to
see how it can ever be any smarter. If we want to make smarter
decisions about what to lock, it seems reasonable to me to think that
the locking code needs to be closer to code that can evaluate
expressions and prune partitions and stuff like that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Fri, Aug 16, 2024 at 12:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 15, 2024 at 8:57 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > TBH, it's more of a hunch that people who are not involved in this
> > development might find the new reality, whereby the execution is not
> > racefree until ExecutorRun(), hard to reason about.
>
> I'm confused by what you mean here by "racefree". A race means
> multiple sessions are doing stuff at the same time and the result
> depends on who does what first, but the executor stuff is all
> backend-private. Heavyweight locks are not backend-private, but those
> would be taken in ExectorStart(), not ExecutorRun(), IIUC.

Sorry, yes, I meant ExecutorStart().  A backend that wants to execute
a plan tree from a CachedPlan is in a race with other backends that
might modify tables before ExecutorStart() takes the remaining locks.
That race window is bigger when it is ExecutorStart() that will take
the locks, and I don't mean in terms of timing, but in terms of the
other code that can run in between GetCachedPlan() returning a
partially valid plan and ExecutorStart() takes the remaining locks
depending on the calling module.

> > With the patch, CreateQueryDesc() and ExecutorStart() are moved to
> > PortalStart() so that QueryDescs including the PlanState trees for all
> > queries are built before any is run.  Why?  So that if ExecutorStart()
> > fails for any query in the list, we can simply throw out the QueryDesc
> > and the PlanState trees of the previous queries (NOT run them) and ask
> > plancache for a new CachedPlan for the list of queries.  We don't have
> > a way to ask plancache.c to replan only a given query in the list.
>
> I agree that moving this from PortalRun() to PortalStart() seems like
> a bad idea, especially in view of what you write below.
>
> > * There's no longer CCI() between queries in PortalRunMulti() because
> > the snapshots in each query's QueryDesc must have been adjusted to
> > reflect the correct command counter.  I've checked but can't really be
> > sure if the value in the snapshot is all anyone ever uses if they want
> > to know the current value of the command counter.
>
> I don't think anything stops somebody wanting to look at the current
> value of the command counter. I also don't think you can remove the
> CommandCounterIncrement() calls between successive queries, because
> then they won't see the effects of earlier calls. So this sounds
> broken to me.

I suppose you mean CCI between "running" (calling ExecutorRun on)
successive queries.  Then the patch is indeed broken.  If we're to
make that right, the number of CCIs for the multi-query portals will
have to double given the separation of ExecutorStart() and
ExecutorRun() phases.

> Also keep in mind that one of the queries could call a function which
> does something that bumps the command counter again. I'm not sure if
> that creates its own hazzard separate from the lack of CCIs, or
> whether it's just another part of that same issue. But you can't
> assume that each query's snapshot should have a command counter value
> one more than the previous query.
>
> While this all seems bad for the partially-initialized-execution-tree
> approach, I wonder if you don't have problems here with the other
> design, too. Let's say you've the multi-query case and there are 2
> queries. The first one (Q1) is SELECT mysterious_function() and the
> second one (Q2) is SELECT * FROM range_partitioned_table WHERE
> key_column = 42. What if mysterious_function() performs DDL on
> range_partitioned_table? I haven't tested this so maybe there are
> things going on here that prevent trouble, but it seems like executing
> Q1 can easily invalidate the plan for Q2. And then it seems like
> you're basically back to the same problem.

A rule (but not views AFAICS) can lead to the multi-query case (there
might be other ways).  I tried the following, and, yes, the plan for
the query queued by the rule is broken by the execution of that for
the 1st query:

create table foo (a int);
create table bar (a int);
create or replace function foo_trig_func () returns trigger as $$
begin drop table bar cascade; return new.*; end; $$ language plpgsql;
create trigger foo_trig before insert on foo execute function foo_trig_func();
create rule insert_foo AS ON insert TO foo do also insert into bar
values (new.*);
set plan_cache_mode to force_generic_plan ;
prepare q as insert into foo values (1);
execute q;
NOTICE:  drop cascades to rule insert_foo on table foo
ERROR:  relation with OID 16418 does not exist

The ERROR comes from trying to run (actually "initialize") the cached
plan for `insert into bar values (new.*);` which is due to the rule.

Though, it doesn't have to be a cached plan for the breakage to
happen.  You can see the same error without the prepared statement:

insert into foo values (1);
NOTICE:  drop cascades to rule insert_foo on table foo
ERROR:  relation with OID 16418 does not exist

Another example:

create or replace function foo_trig_func () returns trigger as $$
begin alter table bar add b int; return new.*; end; $$ language
plpgsql;
execute q;
ERROR:  table row type and query-specified row type do not match
DETAIL:  Query has too few columns.

insert into foo values (1);
ERROR:  table row type and query-specified row type do not match
DETAIL:  Query has too few columns.

This time the error occurs in ExecModifyTable(), so when "running" the
plan, but again the code that's throwing the error is just "lazy"
initialization of the ProjectionInfo when inserting into bar.

So it is possible for the executor to try to run a plan that has
become invalid since it was created, so...

> > > > 3. The need to add *back* the fields to store the RT indexes of
> > > > relations that are not looked at by ExecInitNode() traversal such as
> > > > root partitioned tables and non-leaf partitions.
> > >
> > > I don't remember exactly why we removed those or what the benefit was,
> > > so I'm not sure how big of a problem it is if we have to put them
> > > back.
> >
> > We removed those in commit 52ed730d511b after commit f2343653f5b2
> > removed redundant execution-time locking of non-leaf relations.  So we
> > removed them because we realized that execution time locking is
> > unnecessary given that AcquireExecutorLocks() exists and now we want
> > to add them back because we'd like to get rid of
> > AcquireExecutorLocks(). :-)
>
> My bias is to believe that getting rid of AcquireExecutorLocks() is
> probably the right thing to do, but that's not a strongly-held
> position and I could be totally wrong about it. The thing is, though,
> that AcquireExecutorLocks() is fundamentally stupid, and it's hard to
> see how it can ever be any smarter. If we want to make smarter
> decisions about what to lock, it seems reasonable to me to think that
> the locking code needs to be closer to code that can evaluate
> expressions and prune partitions and stuff like that.

One perhaps crazy idea [1]:

What if we remove AcquireExecutorLocks() and move the responsibility
of taking the remaining necessary locks into the executor (those on
any inheritance children that are added during planning and thus not
accounted for by AcquirePlannerLocks()), like the patch already does,
but don't make it also check if the plan has become invalid, which it
can't do anyway unless it's from a CachedPlan.  That means we instead
let the executor throw any errors that occur when trying to either
initialize the plan because of the changes that have occurred to the
objects referenced in the plan, like what is happening in the above
example.  If that case is going to be rare anway, why spend energy on
checking the validity and replan, especially if that's not an easy
thing to do as we're finding out.  In the above example, we could say
that it's a user error to create a rule like that, so it should not
happen in practice, but when it does, the executor seems to deal with
it correctly by refusing to execute a broken plan .  Perhaps it's more
worthwhile to make the executor behave correctly in face of plan
invalidation than teach the rest of the system to deal with the
executor throwing its hands up when it runs into an invalid plan?
Again, I think this may be a crazy line of thinking but just wanted to
get it out there.

--
Thanks, Amit Langote

[1] I recall Michael Paquier mentioning something like this to me once
when I was describing this patch and thread to him.



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote:
> So it is possible for the executor to try to run a plan that has
> become invalid since it was created, so...

I'm not sure what the "so what" here is.

> One perhaps crazy idea [1]:
>
> What if we remove AcquireExecutorLocks() and move the responsibility
> of taking the remaining necessary locks into the executor (those on
> any inheritance children that are added during planning and thus not
> accounted for by AcquirePlannerLocks()), like the patch already does,
> but don't make it also check if the plan has become invalid, which it
> can't do anyway unless it's from a CachedPlan.  That means we instead
> let the executor throw any errors that occur when trying to either
> initialize the plan because of the changes that have occurred to the
> objects referenced in the plan, like what is happening in the above
> example.  If that case is going to be rare anway, why spend energy on
> checking the validity and replan, especially if that's not an easy
> thing to do as we're finding out.  In the above example, we could say
> that it's a user error to create a rule like that, so it should not
> happen in practice, but when it does, the executor seems to deal with
> it correctly by refusing to execute a broken plan .  Perhaps it's more
> worthwhile to make the executor behave correctly in face of plan
> invalidation than teach the rest of the system to deal with the
> executor throwing its hands up when it runs into an invalid plan?
> Again, I think this may be a crazy line of thinking but just wanted to
> get it out there.

I don't know whether this is crazy or not. I think there are two
issues. One, the set of checks that we have right now might not be
complete, and we might just not have realized that because it happens
infrequently enough that we haven't found all the bugs. If that's so,
then a change like this could be a good thing, because it might force
us to fix stuff we should be fixing anyway. I have a feeling that some
of the checks you hit there were added as bug fixes long after the
code was written originally, so my confidence that we don't have more
bugs isn't especially high.

And two, it matters a lot how frequent the errors will be in practice.
I think we normally try to replan rather than let a stale plan be used
because we want to not fail, because users don't like failure. If the
design you propose here would make failures more (or less) frequent,
then that's a problem (or awesome).

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote:
>> So it is possible for the executor to try to run a plan that has
>> become invalid since it was created, so...

> I'm not sure what the "so what" here is.

The fact that there are holes in our protections against that doesn't
make it a good idea to walk away from the protections.  That path
leads to crashes and data corruption and unhappy users.

What the examples here are showing is that AcquireExecutorLocks
is incomplete because it only provides defenses against DDL
initiated by other sessions, not by our own session.  We have
CheckTableNotInUse but I'm not sure if it could be applied here.
We certainly aren't calling that in anywhere near as systematic
a way as we have for acquiring locks.

Maybe we should rethink the principle that a session's locks
never conflict against itself, although I fear that might be
a nasty can of worms.

Could it work to do CheckTableNotInUse when acquiring an
exclusive table lock?  I don't doubt that we'd have to fix some
code paths, but if the damage isn't extensive then that
might offer a more nearly bulletproof approach.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Mon, Aug 19, 2024 at 12:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What the examples here are showing is that AcquireExecutorLocks
> is incomplete because it only provides defenses against DDL
> initiated by other sessions, not by our own session.  We have
> CheckTableNotInUse but I'm not sure if it could be applied here.
> We certainly aren't calling that in anywhere near as systematic
> a way as we have for acquiring locks.
>
> Maybe we should rethink the principle that a session's locks
> never conflict against itself, although I fear that might be
> a nasty can of worms.

It might not be that bad. It could replace the CheckTableNotInUse()
protections that we have today but maybe cover more cases, and it
could do so without needing any changes to the shared lock manager.
Say every time you start a query you give that query an ID number, and
all locks taken by that query are tagged with that ID number in the
local lock table, and maybe some flags indicating why the lock was
taken. When a new lock acquisition comes along you can say "oh, this
lock was previously taken so that we could do thus-and-so" and then
use that to fail with the appropriate error message. That seems like
it might be more powerful than the refcnt check within
CheckTableNotInUse().

But that seems somewhat incidental to what this thread is about. IIUC,
Amit's original design involved having the plan cache call some new
executor function to do partition pruning before lock acquisition, and
then passing that data structure around, including back to the
executor, so that we didn't repeat the pruning we already did, which
would be a bad thing to do not only because it would incur CPU cost
but also because really bad things would happen if we got a different
answer the second time. IIUC, you didn't think that was going to work
out nicely, and suggested instead moving the pruning+locking to
ExecutorStart() time. But now Amit is finding problems with that
approach, because by the time we reach PortalRun() for the
PORTAL_MULTI_QUERY case, it's too late to replan, because we can't ask
the plancache to replan just one query from the list; and if we try to
fix that by moving ExecutorStart() to PortalStart(), then there are
other problems. Do you have a view on what the way forward might be?

This thread has gotten a tad depressing, honestly. All of the opinions
about what we ought to do seem to be based on the firm conviction that
X or Y or Z will not work, rather than on the confidence that A or B
or C will work. Yet I'm inclined to believe this problem is solvable.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> But that seems somewhat incidental to what this thread is about.

Perhaps.  But if we're running into issues related to that, it might
be good to set aside the long-term goal for a bit and come up with
a cleaner answer for intra-session locking.  That could allow the
pruning problem to be solved more cleanly in turn, and it'd be
an improvement even if not.

> Do you have a view on what the way forward might be?

I'm fresh out of ideas at the moment, other than having a hope that
divide-and-conquer (ie, solving subproblems first) might pay off.

> This thread has gotten a tad depressing, honestly. All of the opinions
> about what we ought to do seem to be based on the firm conviction that
> X or Y or Z will not work, rather than on the confidence that A or B
> or C will work. Yet I'm inclined to believe this problem is solvable.

Yeah.  We are working in an extremely not-green field here, which
means it's a lot easier to see pre-existing reasons why X will not
work than to have confidence that it will work.  But hey, if this
were easy then we'd have done it already.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Mon, Aug 19, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > But that seems somewhat incidental to what this thread is about.
>
> Perhaps.  But if we're running into issues related to that, it might
> be good to set aside the long-term goal for a bit and come up with
> a cleaner answer for intra-session locking.  That could allow the
> pruning problem to be solved more cleanly in turn, and it'd be
> an improvement even if not.

Maybe, but the pieces aren't quite coming together for me. Solving
this would mean that if we execute a stale plan, we'd be more likely
to get a good error and less likely to get a bad, nasty-looking
internal error, or a crash. That's good on its own terms, but we don't
really want user queries to produce errors at all, so I don't think
we'd feel any more free to rearrange the order of operations than we
do today.

> > Do you have a view on what the way forward might be?
>
> I'm fresh out of ideas at the moment, other than having a hope that
> divide-and-conquer (ie, solving subproblems first) might pay off.

Fair enough, but why do you think that the original approach of
creating a data structure from within the plan cache mechanism
(probably via a call into some new executor entrypoint) and then
feeding that through to ExecutorRun() time can't work? Is it possible
you latched onto some non-optimal decisions that the early versions of
the patch made, rather than there being a fundamental problem with the
concept?

I actually thought the do-it-at-executorstart-time approach sounded
pretty good, even though we might have to abandon planstate tree
initialization partway through, right up until Amit started talking
about moving ExecutorStart() from PortalRun() to PortalStart(), which
I have a feeling is going to create a bigger problem than we can
solve. I think if we want to save that approach, we should try to
figure out if we can teach the plancache to replan one query from a
list without replanning the others, which seems like it might allow us
to keep the order of major operations unchanged. Otherwise, it makes
sense to me to have another go at the other approach, at least to make
sure we understand clearly why it can't work.

> Yeah.  We are working in an extremely not-green field here, which
> means it's a lot easier to see pre-existing reasons why X will not
> work than to have confidence that it will work.  But hey, if this
> were easy then we'd have done it already.

Yeah, true.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Aug 20, 2024 at 1:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 16, 2024 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > So it is possible for the executor to try to run a plan that has
> > become invalid since it was created, so...
>
> I'm not sure what the "so what" here is.

I meant that if the executor has to deal with broken plans anyway, we
might as well lean into that fact by choosing not to handle only the
cached plan case in a certain way.  Yes, I understand that that's not
a good justification.

> > One perhaps crazy idea [1]:
> >
> > What if we remove AcquireExecutorLocks() and move the responsibility
> > of taking the remaining necessary locks into the executor (those on
> > any inheritance children that are added during planning and thus not
> > accounted for by AcquirePlannerLocks()), like the patch already does,
> > but don't make it also check if the plan has become invalid, which it
> > can't do anyway unless it's from a CachedPlan.  That means we instead
> > let the executor throw any errors that occur when trying to either
> > initialize the plan because of the changes that have occurred to the
> > objects referenced in the plan, like what is happening in the above
> > example.  If that case is going to be rare anway, why spend energy on
> > checking the validity and replan, especially if that's not an easy
> > thing to do as we're finding out.  In the above example, we could say
> > that it's a user error to create a rule like that, so it should not
> > happen in practice, but when it does, the executor seems to deal with
> > it correctly by refusing to execute a broken plan .  Perhaps it's more
> > worthwhile to make the executor behave correctly in face of plan
> > invalidation than teach the rest of the system to deal with the
> > executor throwing its hands up when it runs into an invalid plan?
> > Again, I think this may be a crazy line of thinking but just wanted to
> > get it out there.
>
> I don't know whether this is crazy or not. I think there are two
> issues. One, the set of checks that we have right now might not be
> complete, and we might just not have realized that because it happens
> infrequently enough that we haven't found all the bugs. If that's so,
> then a change like this could be a good thing, because it might force
> us to fix stuff we should be fixing anyway. I have a feeling that some
> of the checks you hit there were added as bug fixes long after the
> code was written originally, so my confidence that we don't have more
> bugs isn't especially high.

This makes sense.

> And two, it matters a lot how frequent the errors will be in practice.
> I think we normally try to replan rather than let a stale plan be used
> because we want to not fail, because users don't like failure. If the
> design you propose here would make failures more (or less) frequent,
> then that's a problem (or awesome).

I think we'd modify plancache.c to postpone the locking of only
prunable relations (i.e., partitions), so we're looking at only a
handful of concurrent modifications that are going to cause execution
errors.  That's because we disallow many DDL modifications of
partitions unless they are done via recursion from the parent, so the
space of errors in practice would be smaller compared to if we were to
postpone *all* cached plan locks to ExecInitNode() time.  DROP INDEX
a_partion_only_index comes to mind as something that might cause an
error.  I've not tested if other partition-only constraints can cause
unsafe behaviors.

Perhaps, we can add the check for CachedPlan.is_valid after every
table_open() and index_open() in the executor that takes a lock or at
all the places we discussed previously and throw the error (say:
"cached plan is no longer valid") if it's false.  That's better than
running into and throwing into some random error by soldiering ahead
with its initialization / execution, but still a loss in terms of user
experience because we're adding a new failure mode, however rare.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Aug 20, 2024 at 3:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 19, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > But that seems somewhat incidental to what this thread is about.
> >
> > Perhaps.  But if we're running into issues related to that, it might
> > be good to set aside the long-term goal for a bit and come up with
> > a cleaner answer for intra-session locking.  That could allow the
> > pruning problem to be solved more cleanly in turn, and it'd be
> > an improvement even if not.
>
> Maybe, but the pieces aren't quite coming together for me. Solving
> this would mean that if we execute a stale plan, we'd be more likely
> to get a good error and less likely to get a bad, nasty-looking
> internal error, or a crash. That's good on its own terms, but we don't
> really want user queries to produce errors at all, so I don't think
> we'd feel any more free to rearrange the order of operations than we
> do today.

Yeah, it's unclear whether executing a potentially stale plan is an
acceptable tradeoff compared to replanning, especially if it occurs
rarely. Personally, I would prefer that it is.

> > > Do you have a view on what the way forward might be?
> >
> > I'm fresh out of ideas at the moment, other than having a hope that
> > divide-and-conquer (ie, solving subproblems first) might pay off.
>
> Fair enough, but why do you think that the original approach of
> creating a data structure from within the plan cache mechanism
> (probably via a call into some new executor entrypoint) and then
> feeding that through to ExecutorRun() time can't work?

That would be ExecutorStart().  The data structure need not be
referenced after ExecInitNode().

> Is it possible
> you latched onto some non-optimal decisions that the early versions of
> the patch made, rather than there being a fundamental problem with the
> concept?
>
> I actually thought the do-it-at-executorstart-time approach sounded
> pretty good, even though we might have to abandon planstate tree
> initialization partway through, right up until Amit started talking
> about moving ExecutorStart() from PortalRun() to PortalStart(), which
> I have a feeling is going to create a bigger problem than we can
> solve. I think if we want to save that approach, we should try to
> figure out if we can teach the plancache to replan one query from a
> list without replanning the others, which seems like it might allow us
> to keep the order of major operations unchanged.  Otherwise, it makes
> sense to me to have another go at the other approach, at least to make
> sure we understand clearly why it can't work.

+1

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Tue, Aug 20, 2024 at 9:00 AM Amit Langote <amitlangote09@gmail.com> wrote:
> I think we'd modify plancache.c to postpone the locking of only
> prunable relations (i.e., partitions), so we're looking at only a
> handful of concurrent modifications that are going to cause execution
> errors.  That's because we disallow many DDL modifications of
> partitions unless they are done via recursion from the parent, so the
> space of errors in practice would be smaller compared to if we were to
> postpone *all* cached plan locks to ExecInitNode() time.  DROP INDEX
> a_partion_only_index comes to mind as something that might cause an
> error.  I've not tested if other partition-only constraints can cause
> unsafe behaviors.

This seems like a valid point to some extent, but in other contexts
we've had discussions about how we don't actually guarantee all that
much uniformity between a partitioned table and its partitions, and
it's been questioned whether we made the right decisions there. So I'm
not entirely sure that the surface area for problems here will be as
narrow as you're hoping -- I think we'd need to go through all of the
ALTER TABLE variants and think it through. But maybe the problems
aren't that bad.

It does seem like constraints can change the plan. Imagine the
partition had a CHECK(false) constraint before and now doesn't, or
something.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Tue, Aug 20, 2024 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 20, 2024 at 9:00 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > I think we'd modify plancache.c to postpone the locking of only
> > prunable relations (i.e., partitions), so we're looking at only a
> > handful of concurrent modifications that are going to cause execution
> > errors.  That's because we disallow many DDL modifications of
> > partitions unless they are done via recursion from the parent, so the
> > space of errors in practice would be smaller compared to if we were to
> > postpone *all* cached plan locks to ExecInitNode() time.  DROP INDEX
> > a_partion_only_index comes to mind as something that might cause an
> > error.  I've not tested if other partition-only constraints can cause
> > unsafe behaviors.
>
> This seems like a valid point to some extent, but in other contexts
> we've had discussions about how we don't actually guarantee all that
> much uniformity between a partitioned table and its partitions, and
> it's been questioned whether we made the right decisions there. So I'm
> not entirely sure that the surface area for problems here will be as
> narrow as you're hoping -- I think we'd need to go through all of the
> ALTER TABLE variants and think it through. But maybe the problems
> aren't that bad.

Many changeable properties that are reflected in the RelationData of a
partition after getting the lock on it seem to cause no issues as long
as the executor code only looks at RelationData, which is true for
most Scan nodes.  It also seems true for ModifyTable which looks into
RelationData for relation properties relevant to insert/deletes.

The two things that don't cope are:

* Index Scan nodes with concurrent DROP INDEX of partition-only indexes.

* Concurrent DROP CONSTRAINT of partition-only CHECK and NOT NULL
constraints can lead to incorrect result as I write below.

> It does seem like constraints can change the plan. Imagine the
> partition had a CHECK(false) constraint before and now doesn't, or
> something.

Yeah, if the CHECK constraint gets dropped concurrently, any new rows
that got added after that will not be returned by executing a stale
cached plan, because the plan would have been created based on the
assumption that such rows shouldn't be there due to the CHECK
constraint.  We currently don't explicitly check that the constraints
that were used during planning still exist before executing the plan.

Overall, I'm starting to feel less enthused by the idea throwing an
error in the executor due to known and unknown hazards of trying to
execute a stale plan.  Even if we made a note in the docs of such
hazards, any users who run into these rare errors are likely to head
to -bugs or -hackers anyway.

Tom said we should perhaps look at the hazards caused by intra-session
locking, but we'd still be left with the hazards of missing index and
constraints,  AFAICS, due to DROP from other sessions.

So, the options:

* The replanning aspect of the lock-in-the-executor design would be
simpler if a CachedPlan contained the plan for a single query rather
than a list of queries, as previously mentioned. This is particularly
due to the requirements of the PORTAL_MULTI_QUERY case. However, this
option might be impractical.

* Polish the patch for the old design of doing the initial pruning
before AcquireExecutorLocks() and focus on hashing out any bugs and
issues of that design.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote:
> * The replanning aspect of the lock-in-the-executor design would be
> simpler if a CachedPlan contained the plan for a single query rather
> than a list of queries, as previously mentioned. This is particularly
> due to the requirements of the PORTAL_MULTI_QUERY case. However, this
> option might be impractical.

It might be, but maybe it would be worth a try? I mean,
GetCachedPlan() seems to just call pg_plan_queries() which just loops
over the list of query trees and does the same thing for each one. If
we wanted to replan a single query, why couldn't we do
fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then
call pg_plan_queries(fake_querytree_list)? Or something equivalent to
that. We could have a new GetCachedSinglePlan(cplan, n) to do this.

> * Polish the patch for the old design of doing the initial pruning
> before AcquireExecutorLocks() and focus on hashing out any bugs and
> issues of that design.

That's also an option. It probably has issues too, but I don't know
what they are exactly.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Wed, Aug 21, 2024 at 10:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > * The replanning aspect of the lock-in-the-executor design would be
> > simpler if a CachedPlan contained the plan for a single query rather
> > than a list of queries, as previously mentioned. This is particularly
> > due to the requirements of the PORTAL_MULTI_QUERY case. However, this
> > option might be impractical.
>
> It might be, but maybe it would be worth a try? I mean,
> GetCachedPlan() seems to just call pg_plan_queries() which just loops
> over the list of query trees and does the same thing for each one. If
> we wanted to replan a single query, why couldn't we do
> fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then
> call pg_plan_queries(fake_querytree_list)? Or something equivalent to
> that. We could have a new GetCachedSinglePlan(cplan, n) to do this.

I've been hacking to prototype this, and it's showing promise. It
helps make the replan loop at the call sites that start the executor
with an invalidatable plan more localized and less prone to
action-at-a-distance issues. However, the interface and contract of
the new function in my prototype are pretty specialized for the replan
loop in this context—meaning it's not as general-purpose as
GetCachedPlan(). Essentially, what you get when you call it is a
'throwaway' CachedPlan containing only the plan for the query that
failed during ExecutorStart(), not a plan integrated into the original
CachedPlanSource's stmt_list. A call site entering the replan loop
will retry the execution with that throwaway plan, release it once
done, and resume looping over the plans in the original list. The
invalid plan that remains in the original list will be discarded and
replanned in the next call to GetCachedPlan() using the same
CachedPlanSource. While that may sound undesirable, I'm inclined to
think it's not something that needs optimization, given that we're
expecting this code path to be taken rarely.

I'll post a version of a revamped locks-in-the-executor patch set
using the above function after debugging some more.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Junwang Zhao
Date:
Hi,

On Thu, Aug 29, 2024 at 9:34 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Fri, Aug 23, 2024 at 9:48 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Wed, Aug 21, 2024 at 10:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Wed, Aug 21, 2024 at 8:45 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > > > * The replanning aspect of the lock-in-the-executor design would be
> > > > simpler if a CachedPlan contained the plan for a single query rather
> > > > than a list of queries, as previously mentioned. This is particularly
> > > > due to the requirements of the PORTAL_MULTI_QUERY case. However, this
> > > > option might be impractical.
> > >
> > > It might be, but maybe it would be worth a try? I mean,
> > > GetCachedPlan() seems to just call pg_plan_queries() which just loops
> > > over the list of query trees and does the same thing for each one. If
> > > we wanted to replan a single query, why couldn't we do
> > > fake_querytree_list = list_make1(list_nth(querytree_list, n)) and then
> > > call pg_plan_queries(fake_querytree_list)? Or something equivalent to
> > > that. We could have a new GetCachedSinglePlan(cplan, n) to do this.
> >
> > I've been hacking to prototype this, and it's showing promise. It
> > helps make the replan loop at the call sites that start the executor
> > with an invalidatable plan more localized and less prone to
> > action-at-a-distance issues. However, the interface and contract of
> > the new function in my prototype are pretty specialized for the replan
> > loop in this context—meaning it's not as general-purpose as
> > GetCachedPlan(). Essentially, what you get when you call it is a
> > 'throwaway' CachedPlan containing only the plan for the query that
> > failed during ExecutorStart(), not a plan integrated into the original
> > CachedPlanSource's stmt_list. A call site entering the replan loop
> > will retry the execution with that throwaway plan, release it once
> > done, and resume looping over the plans in the original list. The
> > invalid plan that remains in the original list will be discarded and
> > replanned in the next call to GetCachedPlan() using the same
> > CachedPlanSource. While that may sound undesirable, I'm inclined to
> > think it's not something that needs optimization, given that we're
> > expecting this code path to be taken rarely.
> >
> > I'll post a version of a revamped locks-in-the-executor patch set
> > using the above function after debugging some more.
>
> Here it is.
>
> 0001 implements changes to defer the locking of runtime-prunable
> relations to the executor.  The new design introduces a bitmapset
> field in PlannedStmt to distinguish at runtime between relations that
> are prunable whose locking can be deferred until ExecInitNode() and
> those that are not and must be locked in advance.  The set of prunable
> relations can be constructed by looking at all the PartitionPruneInfos
> in the plan and checking which are subject to "initial" pruning steps.
> The set of unprunable relations is obtained by subtracting those from
> the set of all RT indexes.  This design gets rid of one annoying
> aspect of the old design which was the need to add specialized fields
> to store the RT indexes of partitioned relations that are not
> otherwise referenced in the plan tree. That was necessary because in
> the old design, I had removed the function AcquireExecutorLocks()
> altogether to defer the locking of all child relations to execution.
> In the new design such relations are still locked by
> AcquireExecutorLocks().
>
> 0002 is the old patch to make ExecEndNode() robust against partially
> initialized PlanState nodes by adding NULL checks.
>
> 0003 is the patch to add changes to deal with the CachedPlan becoming
> invalid before the deferred locks on prunable relations are taken.
> I've moved the replan loop into a new wrapper-over-ExecutorStart()
> function instead of having the same logic at multiple sites.  The
> replan logic uses the GetSingleCachedPlan() described in the quoted
> text.  The callers of the new ExecutorStart()-wrapper, which I've
> dubbed ExecutorStartExt(), need to pass the CachedPlanSource and a
> query_index, which is the index of the query being executed in the
> list CachedPlanSource.query_list.  They are needed by
> GetSingleCachedPlan().  The changes outside the executor are pretty
> minimal in this design and all the difficulties of having to loop back
> to GetCachedPlan() are now gone.  I like how this turned out.
>
> One idea that I think might be worth trying to reduce the footprint of
> 0003 is to try to lock the prunable relations in a step of InitPlan()
> separate from ExecInitNode(), which can be implemented by doing the
> initial runtime pruning in that separate step.  That way, we'll have
> all the necessary locks before calling ExecInitNode() and so we don't
> need to sprinkle the CachedPlanStillValid() checks all over the place
> and worry about missed checks and dealing with partially initialized
> PlanState trees.
>
> --
> Thanks, Amit Langote

@@ -1241,7 +1244,7 @@ GetCachedPlan(CachedPlanSource *plansource,
ParamListInfo boundParams,
  if (customplan)
  {
  /* Build a custom plan */
- plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv);
+ plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv, true);

Is the *true* here a typo? Seems it should be *false* for custom plan?

--
Regards
Junwang Zhao



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Sat, Aug 31, 2024 at 9:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
> @@ -1241,7 +1244,7 @@ GetCachedPlan(CachedPlanSource *plansource,
> ParamListInfo boundParams,
>   if (customplan)
>   {
>   /* Build a custom plan */
> - plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv);
> + plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv, true);
>
> Is the *true* here a typo? Seems it should be *false* for custom plan?

That's correct, thanks for catching that.  Will fix.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
Hi Amit,

This is not a full review (sorry!) but here are a few comments.

In general, I don't have a problem with this direction. I thought
Tom's previous proposal of abandoning ExecInitNode() in medias res if
we discover that we need to replan was doable and I still think that,
but ISTM that this approach needs to touch less code, because
abandoning ExecInitNode() partly through means we could have leftover
state to clean up in any node in the PlanState tree, and as we've
discussed, ExecEndNode() isn't necessarily prepared to clean up a
PlanState tree that was only partially processed by ExecInitNode(). As
far as I can see in the time I've spent looking at this today, 0001
looks pretty unobjectionable (with some exceptions that I've noted
below). I also think 0003 looks pretty safe. It seems like partition
pruning moves backward across a pretty modest amount of code that does
pretty well-defined things. Basically, initialization-time pruning now
happens before other types of node initialization, and before setting
up row marks. I do however find the changes in 0002 to be less
obviously correct and less obviously safe; see below for some notes
about that.

In 0001, the name root_parent_relids doesn't seem very clear to me,
and neither does the explanation of what it does. You say
"'root_parent_relids' identifies the relation to which both the parent
plan and the PartitionPruneInfo given by 'part_prune_index' belong."
But it's a set, so what does it mean to identify "the" relation? It's
a set of relations, not just one. And why does the name include the
word "root"? It's neither the PlannerGlobal object, which we often
call root, nor is it the root of the partitioning hierarchy. To me, it
looks like it's just the set of relids that we can potentially prune.
I don't see why this isn't just called "relids", like the field from
which it's copied:

+       pruneinfo->root_parent_relids = parentrel->relids;

It just doesn't seem very root-y or very parent-y.

-       node->part_prune_info = partpruneinfo;
+

Extra blank line.

In 0002, the handling of ExprContexts seems a little bit hard to
understand. Sometimes we're using the PlanState's ExprContext, and
sometimes we're using a separate context owned by the
PartitionedRelPruningData's context, and it's not exactly clear why
that is or what the consequences are. Likewise I wouldn't mind some
more comments or explanation in the commit message of the changes in
this patch related to EState objects. I can't help wondering if the
changes here could have either semantic implications (like expression
evaluation can produce different results than before) or performance
implications (because we create objects that we didn't previously
create). As noted above, this is really my only design-level concern
about 0001-0003.

Typo: partrtitioned

Regrettably, I have not looked seriously at 0004 and 0005, so I can't
comment on those.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Robert,

On Fri, Oct 11, 2024 at 5:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Hi Amit,
>
> This is not a full review (sorry!) but here are a few comments.

Thank you for taking a look.

> In general, I don't have a problem with this direction. I thought
> Tom's previous proposal of abandoning ExecInitNode() in medias res if
> we discover that we need to replan was doable and I still think that,
> but ISTM that this approach needs to touch less code, because
> abandoning ExecInitNode() partly through means we could have leftover
> state to clean up in any node in the PlanState tree, and as we've
> discussed, ExecEndNode() isn't necessarily prepared to clean up a
> PlanState tree that was only partially processed by ExecInitNode().

I will say that I feel more comfortable committing and be responsible
for the refactoring I'm proposing in 0001-0003 than the changes
required to take locks during ExecInitNode(), as seen in the patches
up to version v52..

> As
> far as I can see in the time I've spent looking at this today, 0001
> looks pretty unobjectionable (with some exceptions that I've noted
> below). I also think 0003 looks pretty safe. It seems like partition
> pruning moves backward across a pretty modest amount of code that does
> pretty well-defined things. Basically, initialization-time pruning now
> happens before other types of node initialization, and before setting
> up row marks. I do however find the changes in 0002 to be less
> obviously correct and less obviously safe; see below for some notes
> about that.
>
> In 0001, the name root_parent_relids doesn't seem very clear to me,
> and neither does the explanation of what it does. You say
> "'root_parent_relids' identifies the relation to which both the parent
> plan and the PartitionPruneInfo given by 'part_prune_index' belong."
> But it's a set, so what does it mean to identify "the" relation? It's
> a set of relations, not just one.

The intention is to ensure that the bitmapset in PartitionPruneInfo
corresponds to the apprelids bitmapset in the Append or MergeAppend
node that owns the PartitionPruneInfo. Essentially, root_parent_relids
is used to cross-check that both sets align, ensuring that the pruning
logic applies to the same relations as the parent plan.

> And why does the name include the
> word "root"? It's neither the PlannerGlobal object, which we often
> call root, nor is it the root of the partitioning hierarchy. To me, it
> looks like it's just the set of relids that we can potentially prune.
> I don't see why this isn't just called "relids", like the field from
> which it's copied:
>
> +       pruneinfo->root_parent_relids = parentrel->relids;
>
> It just doesn't seem very root-y or very parent-y.

Maybe just "relids" suffices with a comment updated like this:

 * relids               RelOptInfo.relids of the parent plan node (e.g. Append
 *                      or MergeAppend) to which his PartitionPruneInfo node
 *                      belongs. Used to ensure that the pruning logic matches
 *                      the parent plan's apprelids.

> -       node->part_prune_info = partpruneinfo;
> +
>
> Extra blank line.

Fixed.

> In 0002, the handling of ExprContexts seems a little bit hard to
> understand. Sometimes we're using the PlanState's ExprContext, and
> sometimes we're using a separate context owned by the
> PartitionedRelPruningData's context, and it's not exactly clear why
> that is or what the consequences are. Likewise I wouldn't mind some
> more comments or explanation in the commit message of the changes in
> this patch related to EState objects. I can't help wondering if the
> changes here could have either semantic implications (like expression
> evaluation can produce different results than before) or performance
> implications (because we create objects that we didn't previously
> create).

I have taken another look at whether there's any real need to use
separate ExprContexts for initial and runtime pruning and ISTM there
isn't, so we can make "exec" pruning use the same ExprContext as what
"init" would have used.  There *is* a difference however in how we
initializing the partition key expressions for initial and runtime
pruning, but it's not problematic to use the same ExprContext.

I'll update the commentary a bit more.

> Typo: partrtitioned

Fixed.

> Regrettably, I have not looked seriously at 0004 and 0005, so I can't
> comment on those.

Ok, I'm updating 0005 to change how the CachedPlan is handled when it
becomes invalid during InitPlan(). Currently (v56), a separate
transient CachedPlan is created for the query being initialized when
invalidation occurs. However, it seems better to update the original
CachedPlan in place to avoid extra bookkeeping for transient plans—an
approach Robert suggested in an off-list discussion.

Will post a new version next week.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Robert Haas
Date:
On Fri, Oct 11, 2024 at 3:30 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Maybe just "relids" suffices with a comment updated like this:
>
>  * relids               RelOptInfo.relids of the parent plan node (e.g. Append
>  *                      or MergeAppend) to which his PartitionPruneInfo node
>  *                      belongs. Used to ensure that the pruning logic matches
>  *                      the parent plan's apprelids.

LGTM.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 11, 2024 at 3:30 AM Amit Langote <amitlangote09@gmail.com> wrote:
>> Maybe just "relids" suffices with a comment updated like this:
>>
>> * relids               RelOptInfo.relids of the parent plan node (e.g. Append
>> *                      or MergeAppend) to which his PartitionPruneInfo node
>> *                      belongs. Used to ensure that the pruning logic matches
>> *                      the parent plan's apprelids.

> LGTM.

"his" -> "this", surely?

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
Hi Tomas,

On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote:
> Hi,
>
> I took a look at this patch, mostly to familiarize myself with the
> pruning etc. I have a bunch of comments, but all of that is minor,
> perhaps even nitpicking - with prior feedback from David, Tom and
> Robert, I can't really compete with that.

Thanks for looking at this.  These are helpful.

> FWIW the patch needs a rebase, there's a minor bitrot - but it was
> simply enough to fix for a review / testing.
>
>
> 0001
> ----
>
> 1) But if we don't expect this error to actually happen, do we really
> need to make it ereport()? Maybe it should be plain elog(). I mean, it's
> "can't happen" and thus doesn't need translations etc.
>
>     if (!bms_equal(relids, pruneinfo->relids))
>         ereport(ERROR,
>                 errcode(ERRCODE_INTERNAL_ERROR),
>                 errmsg_internal("mismatching PartitionPruneInfo found at
> part_prune_index %d",
>                                 part_prune_index),
>                 errdetail_internal("plan node relids %s, pruneinfo
> relids %s",
>                                    bmsToString(relids),
>                                    bmsToString(pruneinfo->relids)));

I'm fine with elog() here even if it causes the message to be longer:

elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index
%d (plan node relids %s, pruneinfo relids %s)

> Perhaps it should even be an assert?

I am not sure about that.  Having a message handy might be good if a
user ends up hitting this case for whatever reason, like trying to run
a corrupted plan.

> 2) unnecessary newline added to execPartition.h

Perhaps you meant "removed".  Fixed.

> 3) this comment in EState doesn't seem very helpful
>
>     List       *es_part_prune_infos;    /* PlannedStmt.partPruneInfos */

Agreed, fixed to be like the comment for es_rteperminfos:

List       *es_part_prune_infos;    /* List of PartitionPruneInfo */

> 5) PlannerGlobal
>
>     /* List of PartitionPruneInfo contained in the plan */
>     List       *partPruneInfos;
>
> Why does this say "contained in the plan" unlike the other fields? Is
> there some sort of difference? I'm not saying it's wrong.

Ok, maybe the following is a bit more helpful and like the comment for
other fields:

    /* "flat" list of PartitionPruneInfos */
    List       *partPruneInfos;

> 0002
> ----
>
> 1) Isn't it weird/undesirable partkey_datum_from_expr() loses some of
> the asserts? Would the assert be incorrect in the new implementation, or
> are we removing it simply because we happen to not have one of the fields?

The former -- the asserts would be incorrect in the new implementation
-- because in the new implementation a standalone ExprContext is used
that is independent of the parent PlanState (when available) for both
types of runtime pruning.

The old asserts, particularly the second one, weren't asserting
something very useful anyway, IMO.  What I mean is that the
ExprContext provided in the PartitionPruneContext to be the same as
the parent PlanState's ps_ExprContext isn't critical to the code that
follows.  Nor whether the PlanState is available or not.

> 2) inconsistent spelling: run-time vs. runtime

I assume you meant in this comment:

* estate                       The EState for the query doing runtime pruning

Fixed by using run-time, which is a more commonly used term in the
source code than runtime.

> 3) PartitionPruneContext.is_valid - I think I'd rename the flag to
> "initialized" or something like that. The "is_valid" is a bit confusing,
> because it might seem the context can get invalidated later, but AFAICS
> that's not the case - we just initialize it lazily.

Agree that "initialized" is better, so renamed.

> 0003
> ----
>
> 1) In InitPlan I'd move
>
>     estate->es_part_prune_infos = plannedstmt->partPruneInfos;
>
>    before the comment, which is more about ExecDoInitialPruning.

Makes sense, done.

> 2) I'm not quite sure what "exec" partition pruning is?
>
> /*
>  * ExecInitPartitionPruning
>  *   Initialize the data structures needed for runtime "exec" partition
>  *   pruning and return the result of initial pruning, if available.
>
> Is that the same thing as "runtime pruning"?

"Exec" pruning refers to pruning performed during execution, using
PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan
initialization, using parameters whose values remain constant during
execution, such as PARAM_EXTERN parameters and stable functions.

Before this patch, the ExecInitPartitionPruning function, called
during ExecutorStart(), performed "init" pruning and set up state in
the PartitionPruneState for subsequent "exec" pruning during
ExecutorRun(). With this patch, "init" pruning is performed well
before this function is called, leaving its sole responsibility to
setting up the state for "exec" pruning. It may be worth renaming the
function to better reflect this new role, rather than updating only
the comment.

Actually, that is what I decided to do in the attached, along with
some other adjustments like moving ExecDoInitialPruning() to
execPartition.c from execMain.c, fixing up some obsolete comments,
etc.

> 0004
> ----
>
> 1) typo: paraller/parallel

Oops, fixed.

> 2) What about adding an assert to ExecFindMatchingSubPlans, to check
> valisubplan_rtis is not NULL? It's just mentioned in a comment, but
> better to explicitly enforce that?

Good idea, done.

>
> 2) It may not be quite clear why ExecInitUpdateProjection() switches to
> mt_updateColnosLists. Should that be explained in a comment, somewhere?

There is a comment in the ModifyTableState struct definition:

    /*
     * List of valid updateColnosLists.  Contains only those belonging to
     * unpruned relations from ModifyTable.updateColnosLists.
     */
    List       *mt_updateColnosLists;

It seems redundant to reiterate this in ExecInitUpdateProjection().

> 3) unnecessary newline in ExecLookupResultRelByOid

Removed.

> 0005
> ----
>
> 1) auto_explain.c - So what happens if the plan gets invalidated? The
> hook explain_ExecutorStart returns early, but then what? Does that break
> the user session somehow, or what?

It will get called again after ExecutorStartExt() loops back to do
ExecutorStart() with a new updated plan tree.

> 2) Isn't it a bit fragile if this requires every extension to update
> and add the ExecPlanStillValid() calls to various places?

The ExecPlanStillValid() call only needs to be added immediately after
the call to standard_ExecutorStart() in an extension's
ExecutorStart_hook() implementation.

> What if an
> extension doesn't do that? What weirdness will happen?

The QueryDesc.planstate won't contain a PlanState tree for starters
and other state information that InitPlan() populates in EState based
on the PlannedStmt.

> Maybe it'd be
> possible to at least check this in some other executor hook? Or at least
> we could ensure the check was done in assert-enabled builds? Or
> something to make extension authors aware of this?

I've added a note in the commit message, but if that's not enough, one
idea might be to change the return type of ExecutorStart_hook so that
the extensions that implement it are forced to be adjusted. Say, from
void to bool to indicate whether standard_ExecutorStart() succeeded
and thus created a "valid" plan.  I had that in the previous versions
of the patch.  Thoughts?

> Aside from going through the patches, I did a simple benchmark to see
> how this works in practice. I did a simple test, with pgbench -S and
> variable number of partitions/clients. I also varied the number of locks
> per transaction, because I was wondering if it may interact with the
> fast-path improvements. See the attached xeon.sh script and CSV with
> results from the 44/88-core machine.
>
> There's also two PDFs visualizing the results, to show the impact as a
> difference between "master" (no patches) vs. "pruning" build with v57
> applied. As usual, "green" is good (faster), read is "bad" (slower).
>
> For most combinations of parameters, there's no impact on throughput.
> Anything in 99-101% is just regular noise, possibly even more. I'm
> trying to reduce the noise a bit more, but this seems acceptable. I'd
> like to discuss three "cases" I see in the results:

Thanks for doing these benchmarks.  I'll reply separately to discuss
the individual cases.

> costing / auto mode
> -------------------
>
> Anyway, this leads me to a related question - not quite a "bug" in the
> patch, but something to perhaps think about. And that's costing, and
> what "auto" should do.
>
> There are two PNG charts, showing throughput for runs with -M prepared
> and 1000 partitions. Each chart shows throughput for the three cache
> modes, and different client counts. There's a clear distinction between
> "master" and "patched" runs - the "generic" plans performed terribly, by
> orders of magnitude. With the patches it beats the "custom" plans.
>
> Which is great! But it also means that while "auto" used to do the right
> thing, with the patches that's not the case.
>
> AFAIK that's because we don't consider the runtime pruning when costing
> the plans, so the cost is calculated as if no pruning happened. And so
> it seems way more expensive than it should ... and it loses with the
> custom scans. Is that correct, or do I understand this wrong?

That's correct. The planner does not consider runtime pruning when
assigning costs to Append or MergeAppend paths in
create_{merge}append_path().

> Just to be clear, I'm not claiming the patch has to deal with this. I
> suppose it can be handled as a future improvement, and I'm not even sure
> there's a good way to consider this during costing. For example, can we
> estimate how many partitions will be pruned?

There have been discussions about this in the 2017 development thread
of run-time pruning [1] and likely at some later point in other
threads.  One simple approach mentioned at [1] is to consider that
only 1 partition will be scanned for queries containing WHERE partkey
= $1, because only 1 partition can contain matching rows with that
condition.

I agree that this should be dealt with sooner than later so users get
generic plans even without having to use force_generic_plan.

I'll post the updated patches tomorrow.

--
Thanks, Amit Langote

[1] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com



Re: generic plans and "initial" pruning

From
Tomas Vondra
Date:

On 12/4/24 14:34, Amit Langote wrote:
> Hi Tomas,
> 
> On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Hi,
>>
>> I took a look at this patch, mostly to familiarize myself with the
>> pruning etc. I have a bunch of comments, but all of that is minor,
>> perhaps even nitpicking - with prior feedback from David, Tom and
>> Robert, I can't really compete with that.
> 
> Thanks for looking at this.  These are helpful.
> 
>> FWIW the patch needs a rebase, there's a minor bitrot - but it was
>> simply enough to fix for a review / testing.
>>
>>
>> 0001
>> ----
>>
>> 1) But if we don't expect this error to actually happen, do we really
>> need to make it ereport()? Maybe it should be plain elog(). I mean, it's
>> "can't happen" and thus doesn't need translations etc.
>>
>>     if (!bms_equal(relids, pruneinfo->relids))
>>         ereport(ERROR,
>>                 errcode(ERRCODE_INTERNAL_ERROR),
>>                 errmsg_internal("mismatching PartitionPruneInfo found at
>> part_prune_index %d",
>>                                 part_prune_index),
>>                 errdetail_internal("plan node relids %s, pruneinfo
>> relids %s",
>>                                    bmsToString(relids),
>>                                    bmsToString(pruneinfo->relids)));
> 
> I'm fine with elog() here even if it causes the message to be longer:
> 
> elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index
> %d (plan node relids %s, pruneinfo relids %s)
> 

I'm not forcing you to do elog, if you think ereport() is better. I'm
only asking because AFAIK the "policy" is that ereport is for cases that
think can happen (and thus get translated), while elog(ERROR) is for
cases that we believe shouldn't happen.

So every time I see "ereport" I ask myself "how could this happen" which
doesn't seem to be the case here.

>> Perhaps it should even be an assert?
> 
> I am not sure about that.  Having a message handy might be good if a
> user ends up hitting this case for whatever reason, like trying to run
> a corrupted plan.
> 

I'm a bit skeptical about this, TBH. If we assume the plan is
"corrupted", why should we notice in this particular place? I mean, it
could be corrupted in a million different ways, and the chance that it
got through all the earlier steps is like 1 in a 1.000.000.

>> 2) unnecessary newline added to execPartition.h
> 
> Perhaps you meant "removed".  Fixed.
> 

Yes, sorry. I misread the diff.

>> 5) PlannerGlobal
>>
>>     /* List of PartitionPruneInfo contained in the plan */
>>     List       *partPruneInfos;
>>
>> Why does this say "contained in the plan" unlike the other fields? Is
>> there some sort of difference? I'm not saying it's wrong.
> 
> Ok, maybe the following is a bit more helpful and like the comment for
> other fields:
> 
>     /* "flat" list of PartitionPruneInfos */
>     List       *partPruneInfos;
> 

WFM

>> 0002
>> ----
>>
>> 1) Isn't it weird/undesirable partkey_datum_from_expr() loses some of
>> the asserts? Would the assert be incorrect in the new implementation, or
>> are we removing it simply because we happen to not have one of the fields?
> 
> The former -- the asserts would be incorrect in the new implementation
> -- because in the new implementation a standalone ExprContext is used
> that is independent of the parent PlanState (when available) for both
> types of runtime pruning.
> 
> The old asserts, particularly the second one, weren't asserting
> something very useful anyway, IMO.  What I mean is that the
> ExprContext provided in the PartitionPruneContext to be the same as
> the parent PlanState's ps_ExprContext isn't critical to the code that
> follows.  Nor whether the PlanState is available or not.
> 

OK, thanks for explaining

>> 2) inconsistent spelling: run-time vs. runtime
> 
> I assume you meant in this comment:
> 
> * estate                       The EState for the query doing runtime pruning
> 
> Fixed by using run-time, which is a more commonly used term in the
> source code than runtime.
> 

Not quite. I was looking at runtime/run-time in the patch files, but now
I realize some of that is preexisting ... Still, maybe the patch should
stick to one spelling.

>> 2) I'm not quite sure what "exec" partition pruning is?
>>
>> /*
>>  * ExecInitPartitionPruning
>>  *   Initialize the data structures needed for runtime "exec" partition
>>  *   pruning and return the result of initial pruning, if available.
>>
>> Is that the same thing as "runtime pruning"?
> 
> "Exec" pruning refers to pruning performed during execution, using
> PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan
> initialization, using parameters whose values remain constant during
> execution, such as PARAM_EXTERN parameters and stable functions.
> 
> Before this patch, the ExecInitPartitionPruning function, called
> during ExecutorStart(), performed "init" pruning and set up state in
> the PartitionPruneState for subsequent "exec" pruning during
> ExecutorRun(). With this patch, "init" pruning is performed well
> before this function is called, leaving its sole responsibility to
> setting up the state for "exec" pruning. It may be worth renaming the
> function to better reflect this new role, rather than updating only
> the comment.
> 
> Actually, that is what I decided to do in the attached, along with
> some other adjustments like moving ExecDoInitialPruning() to
> execPartition.c from execMain.c, fixing up some obsolete comments,
> etc.
> 

I don't see any attachment :-(

Anyway, if I understand correctly, the "runtime pruning" has two
separate cases - initial pruning and exec pruning. Is that right?
> 
>>
>> 2) It may not be quite clear why ExecInitUpdateProjection() switches to
>> mt_updateColnosLists. Should that be explained in a comment, somewhere?
> 
> There is a comment in the ModifyTableState struct definition:
> 
>     /*
>      * List of valid updateColnosLists.  Contains only those belonging to
>      * unpruned relations from ModifyTable.updateColnosLists.
>      */
>     List       *mt_updateColnosLists;
> 
> It seems redundant to reiterate this in ExecInitUpdateProjection().
> 

Ah, I see. Makes sense.

> 
>> 0005
>> ----
>>
>> 1) auto_explain.c - So what happens if the plan gets invalidated? The
>> hook explain_ExecutorStart returns early, but then what? Does that break
>> the user session somehow, or what?
> 
> It will get called again after ExecutorStartExt() loops back to do
> ExecutorStart() with a new updated plan tree.
> 
>> 2) Isn't it a bit fragile if this requires every extension to update
>> and add the ExecPlanStillValid() calls to various places?
> 
> The ExecPlanStillValid() call only needs to be added immediately after
> the call to standard_ExecutorStart() in an extension's
> ExecutorStart_hook() implementation.
> 
>> What if an
>> extension doesn't do that? What weirdness will happen?
> 
> The QueryDesc.planstate won't contain a PlanState tree for starters
> and other state information that InitPlan() populates in EState based
> on the PlannedStmt.
> 

OK, and the consequence is that the query will fail, right?

>> Maybe it'd be
>> possible to at least check this in some other executor hook? Or at least
>> we could ensure the check was done in assert-enabled builds? Or
>> something to make extension authors aware of this?
> 
> I've added a note in the commit message, but if that's not enough, one
> idea might be to change the return type of ExecutorStart_hook so that
> the extensions that implement it are forced to be adjusted. Say, from
> void to bool to indicate whether standard_ExecutorStart() succeeded
> and thus created a "valid" plan.  I had that in the previous versions
> of the patch.  Thoughts?
> 

Maybe. My concern is that this case (plan getting invalidated) is fairly
rare, so it's entirely plausible the extension will seem to work just
fine without the code update for a long time.

Sure, changing the APIs is allowed, I'm just wondering if maybe there
might be a way to not have this issue, or at least notice the missing
call early.

I haven't tried, wouldn't it be better to modify ExecutorStart() to do
the retries internally? I mean, the extensions wouldn't need to check if
the plan is still valid, ExecutorStart() would take care of that. Yeah,
it might need some new arguments, but that's more obvious.

>> Aside from going through the patches, I did a simple benchmark to see
>> how this works in practice. I did a simple test, with pgbench -S and
>> variable number of partitions/clients. I also varied the number of locks
>> per transaction, because I was wondering if it may interact with the
>> fast-path improvements. See the attached xeon.sh script and CSV with
>> results from the 44/88-core machine.
>>
>> There's also two PDFs visualizing the results, to show the impact as a
>> difference between "master" (no patches) vs. "pruning" build with v57
>> applied. As usual, "green" is good (faster), read is "bad" (slower).
>>
>> For most combinations of parameters, there's no impact on throughput.
>> Anything in 99-101% is just regular noise, possibly even more. I'm
>> trying to reduce the noise a bit more, but this seems acceptable. I'd
>> like to discuss three "cases" I see in the results:
> 
> Thanks for doing these benchmarks.  I'll reply separately to discuss
> the individual cases.
> 
>> costing / auto mode
>> -------------------
>>
>> Anyway, this leads me to a related question - not quite a "bug" in the
>> patch, but something to perhaps think about. And that's costing, and
>> what "auto" should do.
>>
>> There are two PNG charts, showing throughput for runs with -M prepared
>> and 1000 partitions. Each chart shows throughput for the three cache
>> modes, and different client counts. There's a clear distinction between
>> "master" and "patched" runs - the "generic" plans performed terribly, by
>> orders of magnitude. With the patches it beats the "custom" plans.
>>
>> Which is great! But it also means that while "auto" used to do the right
>> thing, with the patches that's not the case.
>>
>> AFAIK that's because we don't consider the runtime pruning when costing
>> the plans, so the cost is calculated as if no pruning happened. And so
>> it seems way more expensive than it should ... and it loses with the
>> custom scans. Is that correct, or do I understand this wrong?
> 
> That's correct. The planner does not consider runtime pruning when
> assigning costs to Append or MergeAppend paths in
> create_{merge}append_path().
> 
>> Just to be clear, I'm not claiming the patch has to deal with this. I
>> suppose it can be handled as a future improvement, and I'm not even sure
>> there's a good way to consider this during costing. For example, can we
>> estimate how many partitions will be pruned?
> 
> There have been discussions about this in the 2017 development thread
> of run-time pruning [1] and likely at some later point in other
> threads.  One simple approach mentioned at [1] is to consider that
> only 1 partition will be scanned for queries containing WHERE partkey
> = $1, because only 1 partition can contain matching rows with that
> condition.
> 
> I agree that this should be dealt with sooner than later so users get
> generic plans even without having to use force_generic_plan.
> 
> I'll post the updated patches tomorrow.
> 

Cool, thanks!


regards
-- 
Tomas Vondra




Re: generic plans and "initial" pruning

From
Tom Lane
Date:
Tomas Vondra <tomas@vondra.me> writes:
> I'm not forcing you to do elog, if you think ereport() is better. I'm
> only asking because AFAIK the "policy" is that ereport is for cases that
> think can happen (and thus get translated), while elog(ERROR) is for
> cases that we believe shouldn't happen.

The proposed coding looks fine from that perspective, because it uses
errmsg_internal and errdetail_internal which don't give rise to
translatable strings.  Having said that, if we think this is a
"can't happen" case then it's fair to wonder why go to such lengths
to format it prettily.  Also, I'd argue that the error message
style guidelines still apply, but this errdetail doesn't conform.

            regards, tom lane



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 5, 2024 at 2:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas@vondra.me> writes:
> > I'm not forcing you to do elog, if you think ereport() is better. I'm
> > only asking because AFAIK the "policy" is that ereport is for cases that
> > think can happen (and thus get translated), while elog(ERROR) is for
> > cases that we believe shouldn't happen.
>
> The proposed coding looks fine from that perspective, because it uses
> errmsg_internal and errdetail_internal which don't give rise to
> translatable strings.  Having said that, if we think this is a
> "can't happen" case then it's fair to wonder why go to such lengths
> to format it prettily.  Also, I'd argue that the error message
> style guidelines still apply, but this errdetail doesn't conform.

Thinking about this further, perhaps an Assert is sufficient here. An
Append/MergeAppend node's part_prune_index not pointing to the correct
entry in the global "flat" list of PartitionPruneInfos would indicate
a bug. It seems unlikely that user actions could cause this issue.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 12/4/24 14:34, Amit Langote wrote:
> > On Mon, Dec 2, 2024 at 3:36 AM Tomas Vondra <tomas@vondra.me> wrote:
> >> 0001
> >> ----
> >>
> >> 1) But if we don't expect this error to actually happen, do we really
> >> need to make it ereport()? Maybe it should be plain elog(). I mean, it's
> >> "can't happen" and thus doesn't need translations etc.
> >>
> >>     if (!bms_equal(relids, pruneinfo->relids))
> >>         ereport(ERROR,
> >>                 errcode(ERRCODE_INTERNAL_ERROR),
> >>                 errmsg_internal("mismatching PartitionPruneInfo found at
> >> part_prune_index %d",
> >>                                 part_prune_index),
> >>                 errdetail_internal("plan node relids %s, pruneinfo
> >> relids %s",
> >>                                    bmsToString(relids),
> >>                                    bmsToString(pruneinfo->relids)));
> >
> > I'm fine with elog() here even if it causes the message to be longer:
> >
> > elog(ERROR, "mismatching PartitionPruneInfo found at part_prune_index
> > %d (plan node relids %s, pruneinfo relids %s)
> >
>
> I'm not forcing you to do elog, if you think ereport() is better. I'm
> only asking because AFAIK the "policy" is that ereport is for cases that
> think can happen (and thus get translated), while elog(ERROR) is for
> cases that we believe shouldn't happen.
>
> So every time I see "ereport" I ask myself "how could this happen" which
> doesn't seem to be the case here.
>
> >> Perhaps it should even be an assert?
> >
> > I am not sure about that.  Having a message handy might be good if a
> > user ends up hitting this case for whatever reason, like trying to run
> > a corrupted plan.
>
> I'm a bit skeptical about this, TBH. If we assume the plan is
> "corrupted", why should we notice in this particular place? I mean, it
> could be corrupted in a million different ways, and the chance that it
> got through all the earlier steps is like 1 in a 1.000.000.

Yeah, I am starting to think the same.  Btw, the idea to have a check
and elog() / ereport() came from Alvaro upthread:
https://www.postgresql.org/message-id/20221130181201.mfinyvtob3j5i2a6%40alvherre.pgsql

> >> 2) I'm not quite sure what "exec" partition pruning is?
> >>
> >> /*
> >>  * ExecInitPartitionPruning
> >>  *   Initialize the data structures needed for runtime "exec" partition
> >>  *   pruning and return the result of initial pruning, if available.
> >>
> >> Is that the same thing as "runtime pruning"?
> >
> > "Exec" pruning refers to pruning performed during execution, using
> > PARAM_EXEC parameters. In contrast, "init" pruning occurs during plan
> > initialization, using parameters whose values remain constant during
> > execution, such as PARAM_EXTERN parameters and stable functions.
> >
> > Before this patch, the ExecInitPartitionPruning function, called
> > during ExecutorStart(), performed "init" pruning and set up state in
> > the PartitionPruneState for subsequent "exec" pruning during
> > ExecutorRun(). With this patch, "init" pruning is performed well
> > before this function is called, leaving its sole responsibility to
> > setting up the state for "exec" pruning. It may be worth renaming the
> > function to better reflect this new role, rather than updating only
> > the comment.
> >
> > Actually, that is what I decided to do in the attached, along with
> > some other adjustments like moving ExecDoInitialPruning() to
> > execPartition.c from execMain.c, fixing up some obsolete comments,
> > etc.
> >
>
> I don't see any attachment :-(
>
> Anyway, if I understand correctly, the "runtime pruning" has two
> separate cases - initial pruning and exec pruning. Is that right?

That's correct.  These patches are about performing "initial" pruning
at a different time and place so that we can take the deferred locks
on the unpruned partitions before we perform ExecInitNode() on any of
the plan trees in the PlannedStmt.

> >> 0005
> >> ----
> >>
> >> 1) auto_explain.c - So what happens if the plan gets invalidated? The
> >> hook explain_ExecutorStart returns early, but then what? Does that break
> >> the user session somehow, or what?
> >
> > It will get called again after ExecutorStartExt() loops back to do
> > ExecutorStart() with a new updated plan tree.
> >
> >> 2) Isn't it a bit fragile if this requires every extension to update
> >> and add the ExecPlanStillValid() calls to various places?
> >
> > The ExecPlanStillValid() call only needs to be added immediately after
> > the call to standard_ExecutorStart() in an extension's
> > ExecutorStart_hook() implementation.
> >
> >> What if an
> >> extension doesn't do that? What weirdness will happen?
> >
> > The QueryDesc.planstate won't contain a PlanState tree for starters
> > and other state information that InitPlan() populates in EState based
> > on the PlannedStmt.
>
> OK, and the consequence is that the query will fail, right?

No, the core executor will retry the execution with a new updated
plan.  In the absence of the early return, the extension might even
crash when accessing such incomplete QueryDesc.

What the patch makes the ExecutorStart_hook do is similar to how
InitPlan() will return early when locks taken on partitions that
survive initial pruning invalidate the plan.

> >> Maybe it'd be
> >> possible to at least check this in some other executor hook? Or at least
> >> we could ensure the check was done in assert-enabled builds? Or
> >> something to make extension authors aware of this?
> >
> > I've added a note in the commit message, but if that's not enough, one
> > idea might be to change the return type of ExecutorStart_hook so that
> > the extensions that implement it are forced to be adjusted. Say, from
> > void to bool to indicate whether standard_ExecutorStart() succeeded
> > and thus created a "valid" plan.  I had that in the previous versions
> > of the patch.  Thoughts?
>
> Maybe. My concern is that this case (plan getting invalidated) is fairly
> rare, so it's entirely plausible the extension will seem to work just
> fine without the code update for a long time.

You might see the errors like the one below when the core executor or
a hook tries to initialize or process in some other way a known
invalid plan, for example, because an unpruned partition's index got
concurrently dropped before the executor got the lock:

ERROR: could not open relation with OID xxx

> Sure, changing the APIs is allowed, I'm just wondering if maybe there
> might be a way to not have this issue, or at least notice the missing
> call early.
>
> I haven't tried, wouldn't it be better to modify ExecutorStart() to do
> the retries internally? I mean, the extensions wouldn't need to check if
> the plan is still valid, ExecutorStart() would take care of that. Yeah,
> it might need some new arguments, but that's more obvious.

One approach could be to move some code from standard_ExecutorStart()
into ExecutorStart(). Specifically, the code responsible for setting
up enough state in the EState to perform ExecDoInitialPruning(), which
takes locks that might invalidate the plan. If the plan does become
invalid, the hook and standard_ExecutorStart() are not called.
Instead, the caller, ExecutorStartExt() in this case, creates a new
plan.

This avoids the need to add ExecPlanStillValid() checks anywhere,
whether in core or extension code. However, it does mean accessing the
PlannedStmt earlier than InitPlan(), but the current placement of the
code is not exactly set in stone.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 5, 2024 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote:
> > Sure, changing the APIs is allowed, I'm just wondering if maybe there
> > might be a way to not have this issue, or at least notice the missing
> > call early.
> >
> > I haven't tried, wouldn't it be better to modify ExecutorStart() to do
> > the retries internally? I mean, the extensions wouldn't need to check if
> > the plan is still valid, ExecutorStart() would take care of that. Yeah,
> > it might need some new arguments, but that's more obvious.
>
> One approach could be to move some code from standard_ExecutorStart()
> into ExecutorStart(). Specifically, the code responsible for setting
> up enough state in the EState to perform ExecDoInitialPruning(), which
> takes locks that might invalidate the plan. If the plan does become
> invalid, the hook and standard_ExecutorStart() are not called.
> Instead, the caller, ExecutorStartExt() in this case, creates a new
> plan.
>
> This avoids the need to add ExecPlanStillValid() checks anywhere,
> whether in core or extension code. However, it does mean accessing the
> PlannedStmt earlier than InitPlan(), but the current placement of the
> code is not exactly set in stone.

I tried this approach and found that it essentially disables testing
of this patch using the delay_execution module, which relies on the
ExecutorStart_hook(). The way the testing works is that the hook in
delay_execution.c pauses the execution of a cached plan to allow a
concurrent session to drop an index referenced in the plan. When
unpaused, execution initialization resumes by calling
standard_ExecutorStart(). At this point, obtaining the lock on the
partition whose index has been dropped invalidates the plan, which the
hook detects and reports. It then also reports the successful
re-execution of an updated plan that no longer references the dropped
index.  Hmm.

--
Thanks, Amit Langote



Re: generic plans and "initial" pruning

From
Tomas Vondra
Date:

On 12/5/24 07:53, Amit Langote wrote:
> On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote:
>> ...
>>
>>>> What if an
>>>> extension doesn't do that? What weirdness will happen?
>>>
>>> The QueryDesc.planstate won't contain a PlanState tree for starters
>>> and other state information that InitPlan() populates in EState based
>>> on the PlannedStmt.
>>
>> OK, and the consequence is that the query will fail, right?
> 
> No, the core executor will retry the execution with a new updated
> plan.  In the absence of the early return, the extension might even
> crash when accessing such incomplete QueryDesc.
> 
> What the patch makes the ExecutorStart_hook do is similar to how
> InitPlan() will return early when locks taken on partitions that
> survive initial pruning invalidate the plan.
> 

Isn't that what I said? My question was what happens if the extension
does not add the new ExecPlanStillValid() call - sorry if that wasn't
clear. If it can crash, that's what I meant by "fail".

>>>> Maybe it'd be
>>>> possible to at least check this in some other executor hook? Or at least
>>>> we could ensure the check was done in assert-enabled builds? Or
>>>> something to make extension authors aware of this?
>>>
>>> I've added a note in the commit message, but if that's not enough, one
>>> idea might be to change the return type of ExecutorStart_hook so that
>>> the extensions that implement it are forced to be adjusted. Say, from
>>> void to bool to indicate whether standard_ExecutorStart() succeeded
>>> and thus created a "valid" plan.  I had that in the previous versions
>>> of the patch.  Thoughts?
>>
>> Maybe. My concern is that this case (plan getting invalidated) is fairly
>> rare, so it's entirely plausible the extension will seem to work just
>> fine without the code update for a long time.
> 
> You might see the errors like the one below when the core executor or
> a hook tries to initialize or process in some other way a known
> invalid plan, for example, because an unpruned partition's index got
> concurrently dropped before the executor got the lock:
> 
> ERROR: could not open relation with OID xxx
> 

Yeah, but how likely is that? How often get plans invalidated in regular
application workload. People don't create or drop indexes very often,
for example ...

Again, I'm not saying requiring the call would be unacceptable, I'm sure
we made similar changes in the past. But if it wasn't needed without too
much contortion, that would be nice.


regards

-- 
Tomas Vondra




Re: generic plans and "initial" pruning

From
Tomas Vondra
Date:

On 12/5/24 12:28, Amit Langote wrote:
> On Thu, Dec 5, 2024 at 3:53 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote:
>>> Sure, changing the APIs is allowed, I'm just wondering if maybe there
>>> might be a way to not have this issue, or at least notice the missing
>>> call early.
>>>
>>> I haven't tried, wouldn't it be better to modify ExecutorStart() to do
>>> the retries internally? I mean, the extensions wouldn't need to check if
>>> the plan is still valid, ExecutorStart() would take care of that. Yeah,
>>> it might need some new arguments, but that's more obvious.
>>
>> One approach could be to move some code from standard_ExecutorStart()
>> into ExecutorStart(). Specifically, the code responsible for setting
>> up enough state in the EState to perform ExecDoInitialPruning(), which
>> takes locks that might invalidate the plan. If the plan does become
>> invalid, the hook and standard_ExecutorStart() are not called.
>> Instead, the caller, ExecutorStartExt() in this case, creates a new
>> plan.
>>
>> This avoids the need to add ExecPlanStillValid() checks anywhere,
>> whether in core or extension code. However, it does mean accessing the
>> PlannedStmt earlier than InitPlan(), but the current placement of the
>> code is not exactly set in stone.
> 
> I tried this approach and found that it essentially disables testing
> of this patch using the delay_execution module, which relies on the
> ExecutorStart_hook(). The way the testing works is that the hook in
> delay_execution.c pauses the execution of a cached plan to allow a
> concurrent session to drop an index referenced in the plan. When
> unpaused, execution initialization resumes by calling
> standard_ExecutorStart(). At this point, obtaining the lock on the
> partition whose index has been dropped invalidates the plan, which the
> hook detects and reports. It then also reports the successful
> re-execution of an updated plan that no longer references the dropped
> index.  Hmm.
> 

It's not clear to me why the change disables this testing, and I can't
try without a patch. Could you explain?


thanks

-- 
Tomas Vondra




Re: generic plans and "initial" pruning

From
Amit Langote
Date:
On Thu, Dec 5, 2024 at 10:53 PM Tomas Vondra <tomas@vondra.me> wrote:
> On 12/5/24 07:53, Amit Langote wrote:
> > On Thu, Dec 5, 2024 at 2:20 AM Tomas Vondra <tomas@vondra.me> wrote:
> >> ...
> >>
> >>>> What if an
> >>>> extension doesn't do that? What weirdness will happen?
> >>>
> >>> The QueryDesc.planstate won't contain a PlanState tree for starters
> >>> and other state information that InitPlan() populates in EState based
> >>> on the PlannedStmt.
> >>
> >> OK, and the consequence is that the query will fail, right?
> >
> > No, the core executor will retry the execution with a new updated
> > plan.  In the absence of the early return, the extension might even
> > crash when accessing such incomplete QueryDesc.
> >
> > What the patch makes the ExecutorStart_hook do is similar to how
> > InitPlan() will return early when locks taken on partitions that
> > survive initial pruning invalidate the plan.
>
> Isn't that what I said? My question was what happens if the extension
> does not add the new ExecPlanStillValid() call - sorry if that wasn't
> clear. If it can crash, that's what I meant by "fail".

Ok, I see.  So, I suppose you meant to confirm if the invalid plan
won't silently be executed returning wrong results.  Yes, I don't
think that would happen given the kinds of invalidations that are
possible.  The various checks in the ExecInitNode() path, such as the
one that catches a missing index, will prevent the plan from running.
I may not have searched exhaustively enough though.

> >>>> Maybe it'd be
> >>>> possible to at least check this in some other executor hook? Or at least
> >>>> we could ensure the check was done in assert-enabled builds? Or
> >>>> something to make extension authors aware of this?
> >>>
> >>> I've added a note in the commit message, but if that's not enough, one
> >>> idea might be to change the return type of ExecutorStart_hook so that
> >>> the extensions that implement it are forced to be adjusted. Say, from
> >>> void to bool to indicate whether standard_ExecutorStart() succeeded
> >>> and thus created a "valid" plan.  I had that in the previous versions
> >>> of the patch.  Thoughts?
> >>
> >> Maybe. My concern is that this case (plan getting invalidated) is fairly
> >> rare, so it's entirely plausible the extension will seem to work just
> >> fine without the code update for a long time.
> >
> > You might see the errors like the one below when the core executor or
> > a hook tries to initialize or process in some other way a known
> > invalid plan, for example, because an unpruned partition's index got
> > concurrently dropped before the executor got the lock:
> >
> > ERROR: could not open relation with OID xxx
>
> Yeah, but how likely is that? How often get plans invalidated in regular
> application workload. People don't create or drop indexes very often,
> for example ...

Yeah, that's a valid point.  Andres once mentioned that ANALYZE can
invalidate plans and that can occur frequently in busy systems.

> Again, I'm not saying requiring the call would be unacceptable, I'm sure
> we made similar changes in the past. But if it wasn't needed without too
> much contortion, that would be nice.

I tend to agree.

Another change introduced by the patch that extensions might need to
mind (noted in the commit message of v58-0004) is the addition of the
es_unpruned_relids field to EState. This field tracks the RT indexes
of relations that are locked and therefore safe to access during
execution. Importantly, it does not include the RT indexes of leaf
partitions that are pruned during "initial" pruning and thus remain
unlocked.

This change means that executor extensions can no longer assume that
all relations in the range table are locked and safe to access.
Instead, extensions must account for the possibility that some
relations, specifically pruned partitions, are not locked. Normally,
executor code accesses relations using ExecGetRangeTableRelation(),
which does not take a lock before returning the Relation pointer,
assuming that locks are already managed upstream.

--
Thanks, Amit Langote