Re: generic plans and "initial" pruning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: generic plans and "initial" pruning
Date
Msg-id CA+HiwqFpEHBjosRackQhm6yKKnHgqm8Ewkn=qsctT1N0PqVSrg@mail.gmail.com
Whole thread Raw
In response to Re: generic plans and "initial" pruning  (Amit Langote <amitlangote09@gmail.com>)
List pgsql-hackers
On Wed, Nov 12, 2025 at 11:17 PM Amit Langote <amitlangote09@gmail.com> wrote:
> The key idea is to avoid taking unnecessary locks when reusing a
> cached plan. To achieve that, we need to perform initial partition
> pruning during cached plan reuse in plancache.c so that only surviving
> partitions are locked. This requires some plumbing to reuse the result
> of this "early" pruning during executor startup, because repeating the
> pruning logic would be both inefficient and potentially inconsistent
> -- what if you get different results the second time? (I don't have
> proof that this can happen, but some earlier emails mention the
> theoretical risk, so better to be safe.)
>
> So this patch introduces ExecutorPrep(), which allows executor
> metadata such as initial pruning results (valid subplan indexes) and
> full unpruned_relids to be computed ahead of execution and reused
> later by ExecutorStart() and during QueryDesc setup in parallel
> workers using the results shared by the leader. The parallel query bit
> was discussed previously at [1], though I didn’t have a solution I
> liked then.
>
...
> The patch set is structured as follows:
>
> * Refactor partition pruning initialization (0001): separates the
> setup of the pruning state from its execution by introducing
> ExecCreatePartitionPruneStates(). This makes the pruning logic easier
> to reuse and adds flexibility to do only the setup but skip pruning in
> some cases.
>
> * Introduce ExecutorPrep infrastructure (0002): adds ExecutorPrep()
> and ExecPrep as a formal way to perform executor setup ahead of
> execution. This enables caching or transferring pruning results and
> other metadata without triggering execution. ExecutorStart() can now
> consume precomputed prep state from the EState created during
> ExecutorPrep().  ExecPrepCleanup() handles cleanup when the plan is
> invalidated during prep and so not executed; the state is cleaned up
> in the regular ExecutorEnd() path otherwise.

In v1 patch, I had not made ExecutorStart() call ExecutorPrep() to do
the prep work (creating EState, setting up es_relations, checking
permissions) when QueryDesc did not carry the results of
ExecutorPrep() from some earlier stage. Instead, InitPlan() would
detect that prep was absent and perform the missing setup itself. On
second thought it is cleaner for ExecutorStart() to detect the absence
of prep and call ExecutorPrep() directly, matching how prep would be
created when coming from plancache et al.

v2 changes the patch to do that.

> * Enable pruning-aware locking in cached / generic plan reuse (0004):
> extends GetCachedPlan() and CheckCachedPlan() to call ExecutorPrep()
> on each PlannedStmt in the CachedPlan, locking only surviving
> partitions. Adds CachedPlanPrepData to pass this through plan cache
> APIs and down to execution via QueryDesc. Also reinstates the
> firstResultRel locking rule added in 28317de72 but later lost due to
> revert of the earlier pruning patch, to ensure correctness when all
> target partitions are pruned.

Looking at the changes to executor/function.c, I also noticed that I
had mistakenly allocated the ExecutorPrep state in
SQLFunctionCache.fcontext whereas the correct context for execution
related state is SQLFunctionCache.subcontext.  In the updated patch,
I've made postquel_start() reparent the prep EState's es_query_cxt to
subcontext from fcontext. I also did not have a test case that
exercised cached plan reuse for SQL functions, so I added one. I split
the function.c's GetCachedPlan() + CachedPlanPrepData plumbing into a
new patch 0005 so it can be reviewed separately, since it is the only
non-mechanical call-site change.

> Benchmark results:
>
> echo "plan_cache_mode = force_generic_plan" >> $PGDATA/postgresql.conf
> for p in 32 64 128 256 512 1024; do pgbench -i --partitions=$p >
> /dev/null 2>&1; echo -ne "$p\t"; pgbench -n -S -T10 -Mprepared | grep
> tps; done
>
> Master
>
> 32 tps = 23841.822407 (without initial connection time)
> 64 tps = 21578.619816 (without initial connection time)
> 128 tps = 18090.500707 (without initial connection time)
> 256 tps = 14152.248201 (without initial connection time)
> 512 tps = 9432.708423 (without initial connection time)
> 1024 tps = 5873.696475 (without initial connection time)
>
> Patched
>
> 32 tps = 24724.245798 (without initial connection time)
> 64 tps = 24858.206407 (without initial connection time)
> 128 tps = 24652.655269 (without initial connection time)
> 256 tps = 23656.756615 (without initial connection time)
> 512 tps = 22299.865769 (without initial connection time)
> 1024 tps = 21911.704317 (without initial connection time)

Re-ran to include 0 partition case and more partitions than 1024:

echo "plan_cache_mode = force_generic_plan" >> $PGDATA/postgresql.conf
for p in 0 8 16 32 64 128 256 512 1024 2048 4096; do pgbench -i
--partitions=$p > /dev/null 2>&1; echo -ne "$p\t"; pgbench -n -S -T10
-Mprepared | grep tps; done

Master

0 tps = 23600.068719 (without initial connection time)
8 tps = 22548.439906 (without initial connection time)
16 tps = 22807.337363 (without initial connection time)
32 tps = 22837.789996 (without initial connection time)
64 tps = 22915.846820 (without initial connection time)
128 tps = 22958.472655 (without initial connection time)
256 tps = 22432.432730 (without initial connection time)
512 tps = 20327.618690 (without initial connection time)
1024 tps = 20554.932475 (without initial connection time)
2048 tps = 19947.061061 (without initial connection time)
4096 tps = 17294.369829 (without initial connection time)

Patched

0 tps = 23869.906654 (without initial connection time)
8 tps = 22682.498914 (without initial connection time)
16 tps = 22714.445711 (without initial connection time)
32 tps = 21653.589371 (without initial connection time)
64 tps = 20571.267545 (without initial connection time)
128 tps = 17138.088269 (without initial connection time)
256 tps = 13027.168426 (without initial connection time)
512 tps = 8689.486966 (without initial connection time)
1024 tps = 5450.525617 (without initial connection time)
2048 tps = 3034.383108 (without initial connection time)
4096 tps = 1560.110609 (without initial connection time)

Tabular format (+ve pct_change means patched better)

 partitions    master        patched       pct_change
 ----------------------------------------------------
 0             23869.91      23600.07       -1.1%
 8             22682.50      22548.44       -0.6%
 16            22714.45      22807.34       +0.4%
 32            21653.59      22837.79       +5.5%
 64            20571.27      22915.85      +11.4%
 128           17138.09      22958.47      +34.0%
 256           13027.17      22432.43      +72.2%
 512            8689.49      20327.62     +133.9%
 1024           5450.53      20554.93     +277.1%
 2048           3034.38      19947.06     +557.4%
 4096           1560.11      17294.37    +1008.5%

I also did some runs for custom plans. The custom plan path should
behave about the same on master and patched since the early
ExecutorPrep() business only applies to generic plan reuse cases.

echo "plan_cache_mode = force_custom_plan" >> $PGDATA/postgresql.conf
for p in 0 8 16 32 64 128 256 512 1024 2048 4096; do pgbench -i
--partitions=$p > /dev/null 2>&1; echo -ne "$p\t"; pgbench -n -S -T10
-Mprepared | grep tps; done

Master

pgbench -n -S -T10 -Mprepared | grep tps; done
0 tps = 22346.419557 (without initial connection time)
8 tps = 20959.115560 (without initial connection time)
16 tps = 21390.573290 (without initial connection time)
32 tps = 21358.292393 (without initial connection time)
64 tps = 21288.742635 (without initial connection time)
128 tps = 21167.721447 (without initial connection time)
256 tps = 21256.618661 (without initial connection time)
512 tps = 19401.261197 (without initial connection time)
1024 tps = 19169.135145 (without initial connection time)
2048 tps = 19504.102179 (without initial connection time)
4096 tps = 18880.855783 (without initial connection time)

Patched

0 tps = 22852.634752 (without initial connection time)
8 tps = 21596.432690 (without initial connection time)
16 tps = 21428.779996 (without initial connection time)
32 tps = 20629.225272 (without initial connection time)
64 tps = 21301.644733 (without initial connection time)
128 tps = 21098.543942 (without initial connection time)
256 tps = 21394.364662 (without initial connection time)
512 tps = 19475.152170 (without initial connection time)
1024 tps = 19585.768438 (without initial connection time)
2048 tps = 19810.211969 (without initial connection time)
4096 tps = 19160.981608 (without initial connection time)

In tabular format:

 partitions    master        patched       pct_change
 ----------------------------------------------------
 0             22346.42      22852.63      +2.3%
 8             20959.12      21596.43      +3.0%
 16            21390.57      21428.78      +0.2%
 32            21358.29      20629.23      -3.4%
 64            21288.74      21301.64      +0.1%
 128           21167.72      21098.54      -0.3%
 256           21256.62      21394.36      +0.6%
 512           19401.26      19475.15      +0.4%
 1024          19169.14      19585.77      +2.2%
 2048          19504.10      19810.21      +1.6%
 4096          18880.86      19160.98      +1.5%

Numbers look within noise range as expected.

--
Thanks, Amit Langote

Attachment

pgsql-hackers by date:

Previous
From: Mircea Cadariu
Date:
Subject: parallel data loading for pgbench -i
Next
From: Daniel Gustafsson
Date:
Subject: Re: misleading error message in DefineIndex