Home > mailing lists
Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From	Lukas Fittl
Subject	Re: Stack-based tracking of per-node WAL/buffer usage
Date	April 5 15:31:53
Msg-id	CAP53PkxDupm5U6TV0LF_YWHQ2wfcSiLsgnDBcO6b5AchLJhp=A@mail.gmail.com Whole thread Raw
In response to	Re: Stack-based tracking of per-node WAL/buffer usage (Andres Freund <andres@anarazel.de>)
Responses	Re: Stack-based tracking of per-node WAL/buffer usage Re: Stack-based tracking of per-node WAL/buffer usage Re: Stack-based tracking of per-node WAL/buffer usage
List	pgsql-hackers
Tree view
Hi Andres,

Thanks for reviewing!

On Sat, Apr 4, 2026 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-04-04 02:43:50 -0700, Lukas Fittl wrote:
> > Attached v12, rebased, otherwise no changes.
> >
> > I realize time to freeze is getting close, and whilst I'd love to see
> > this go in, I'm also realistic - so I'll just do my best to support
> > review in the off chance we can make it for this release.
>
> I unfortunately think there's enough nontrivial design decisions - that I
> don't have a sufficiently confident assesment of - that I would be pretty
> hesitant to commit this at this stage of the cycle without some architectural
> review by senior folks.  If Heikki did another round or two of review, it'd be a
> different story.

Ack - I'll keep it updated as needed in the next days to help, but
understand that there many competing priorities :)

FWIW, I do feel like Heikki has taken a look at the memory management
aspects, and Zsolt also had some good detailed feedback on lower level
logic and OOM behaviour - so I'm actually less worried about these
now.

Heikki, your further review is very welcome, if you have the time.
It'd also be great if you could review the README.instrument (now in
v13/0008) to see if that makes sense to you.

> I think the high level design is a huge improvement and goes in the right
> direction, but some of the lower level stuff I'm far less confident about.

Glad to hear - I also think that the direction here is right, and I
don't think there is a lot of variance on architectural choices
anymore.

>
> > On that note, I think 0001 and 0002 are independently useful
> > refactorings to split the different kinds of instrumentation that
> > should be ready to go, and I don't think should conflict much with
> > other patches in this commitfest.
>
> Yea, I'll see that those get committed.
>
> I could also see 0004 as potentially worth getting committed separately,
> although I'm a bit worried about test stability.

Yeah, I understand. FWIW, the current approach has been stable in CI
runs, but that doesn't mean something in the buildfarm won't complain.

> I recently looked at a coverage report, in the context of the index
> prefetching patch, and was a it surprised that prominent parallel executor
> nodes have no coverage for EXPLAIN ANALYZE.
> parallel BHS is not covered:
> https://coverage.postgresql.org/src/backend/executor/nodeBitmapHeapscan.c.gcov.html#L536
> parallel IOS is not covered:
> https://coverage.postgresql.org/src/backend/executor/nodeIndexonlyscan.c.gcov.html#L430

Ugh. Good catch, that lack of coverage is not good. I've added two
tests for that, and guess what, I found a bug (unrelated to
stack-based instrumentation) - show_tidbitmap_info doesn't correctly
accumulate Heap Blocks information from the worker instrumentation -
it only ever shows that of the leader. For such auxiliary information
in a parallel context it needs to do the extra work, e.g. like
show_indexsearches_info does.

I've put that bugfix and BHS coverage into its own commit (0004)
before the other tests, because I suspect we may even want to
backpatch that. I've added coverage of IOS via a check for Index
Searches as a new patch as well (0005). Let me know if you prefer if I
start a new thread for those.

See attached v13, with feedback addressed, unless otherwise noted
below. I've also slightly simplified InstrAggNode, since I realized it
is never called when running=true.

>
> > From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <lukas@fittl.com>
> > Date: Tue, 9 Sep 2025 02:16:59 -0700
> > Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through
> >  stack-based instrumentation
> >
> > Previously, in order to determine the buffer/WAL usage of a given code
> > section, we utilized continuously incrementing global counters that get
> > updated when the actual activity (e.g. shared block read) occurred, and
> > then calculated a diff when the code section ended. This resulted in a
> > bottleneck for executor node instrumentation specifically, with the
> > function BufferUsageAccumDiff showing up in profiles and in some cases
> > adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
> >
> > Instead, introduce a stack-based mechanism, where the actual activity
> > writes into the current stack entry. In the case of executor nodes, this
> > means that each node gets its own stack entry that is pushed at
> > InstrStartNode, and popped at InstrEndNode. Stack entries are zero
> > initialized (avoiding the diff mechanism) and get added to their parent
> > entry when they are finalized, i.e. no more modifications can occur.
> >
> > To correctly handle abort situations, any use of instrumentation stacks
> > must involve either a top-level QueryInstrumentation struct, and its
> > associated InstrQueryStart/InstrQueryStop helpers (which use resource
> > owners to handle aborts), or the Instrumentation struct itself with
> > dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
> > consistent state after an abort.
> >
> > This also drops the global pgBufferUsage, any callers interested in
> > measuring buffer activity should instead utilize InstrStart/InstrStop.
> >
> > The related global pgWalUsage is kept for now due to its use in pgstat
> > to track aggregate WAL activity and heap_page_prune_and_freeze for
> > measuring FPIs.
>
> Probably worth stating what the performance overhead of WAL and BUFFERS is
> after this patch?

I've added a note re: BUFFERS referencing the COUNT(*) numbers from
earlier in the thread - not sure if WAL is really worth it to talk
about (its already quite a long commit message).

>
>
> > @@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >        */
> >       if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
> >       {
> > -             /*
> > -              * Set up to track total elapsed time in ExecutorRun.  Make sure the
> > -              * space is allocated in the per-query context so it will go away at
> > -              * ExecutorEnd.
> > -              */
> > +             /* Set up to track total elapsed time in ExecutorRun. */
> >               if (queryDesc->totaltime == NULL)
> > -             {
> > -                     MemoryContext oldcxt;
> > -
> > -                     oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
> > -                     queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
> > -                     MemoryContextSwitchTo(oldcxt);
> > -             }
> > +                     queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
> >       }
> >  }
>
> Not at all the fault of this patch, but it does seem somewhat odd to me that
> we handle pgss/auto_explain wanting instrumentation by them updating the
> QueryDesc->totaltime, rather than having extensions add an eflag to ask
> standard_ExecutorStart to do so.

Agreed, that interaction is a bit odd.

I think it'd be reasonable to do this as a separate refactoring,
especially now that I've stopped utilizing totaltime as the parent for
per-node instrumentation, per later notes.

>
> > @@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
> >        * and PARALLEL_KEY_BUFFER_USAGE.
> >        *
> >        * If there are no extensions loaded that care, we could skip this.  We
> > -      * have no way of knowing whether anyone's looking at pgWalUsage or
> > -      * pgBufferUsage, so do it unconditionally.
> > +      * have no way of knowing whether anyone's looking at instrumentation, so
> > +      * do it unconditionally.
> >        */
> >       shm_toc_estimate_chunk(&pcxt->estimator,
> >                                                  mul_size(sizeof(WalUsage), pcxt->nworkers));
> > @@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> >       Relation        indexRel;
> >       LOCKMODE        heapLockmode;
> >       LOCKMODE        indexLockmode;
> > +     QueryInstrumentation *instr;
> >       WalUsage   *walusage;
> >       BufferUsage *bufferusage;
> >       int                     sortmem;
> > @@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> >       tuplesort_attach_shared(sharedsort, seg);
> >
> >       /* Prepare to track buffer usage during parallel execution */
> > -     InstrStartParallelQuery();
> > +     instr = InstrStartParallelQuery();
> >
> >       /*
> >        * Might as well use reliable figure when doling out maintenance_work_mem
> > @@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> >       /* Report WAL/buffer usage during parallel execution */
> >       bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
> >       walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
> > -     InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> > +     InstrEndParallelQuery(instr,
> > +                                               &bufferusage[ParallelWorkerNumber],
> >                                                 &walusage[ParallelWorkerNumber]);
> >
> >       index_close(indexRel, indexLockmode);
>
> Again not your fault, but it feels like the parallel index build
> infrastructure is all wrong. Reimplementing this stuff for every index type
> makes no sense.

Fully agreed, the duplication here is quite something.

The 0006 patch cleans this up a little bit by at least using the
Instrumentation struct consistently. That is not required for the
stack-based commit, but would be helpful if e.g. we were to put IO
stats next to Buffer/WAL stats in the future.


> > +For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
> > +its instrumentation data gets added to the immediate parent in
> > +the execution tree, the NestLoop, which will then get added to Query A's
> > +QueryInstrumentation, which then accumulates to the parent.
> > +
> > +While we can typically think of this as a tree, the NodeInstrumentation
> > +underneath a particular QueryInstrumentation could behave differently --
> > +for example, it could propagate directly to the QueryInstrumentation, in
> > +order to not show cumulative numbers in EXPLAIN ANALYZE.
>
> Hm. This seems like a somewhat random example, why would one want this?
>

Hmm, yeah. I mainly included this because the fact that accumulation
for the individual nodes in the EXPLAIN happens to be in a tree-like
structure is a choice, no longer a requirement. It would be just as
easy to only accumulate to the parent QueryInstrumentation, and let
explain.c present you a choice of "BUFFERS SELF" / etc - we couldn't
have done that previously.

I've left this in for now, but maybe its better to drop it to avoid confusion?

> > +If multiple QueryInstrumentations are active on the stack (e.g. nested
> > +portals), each one's abort handler uses InstrStopFinalize to unwind to
> > +whichever entry is higher up, so they compose correctly regardless of
> > +release order.
>
> Maybe "the abort handler of each uses InstrStopFinalize() to accumulate the
> statics to its parent entry"?

I revised this as follows:

If multiple QueryInstrumentations are active on the stack (e.g. nested
portals), the abort handler of each uses InstrStopFinalize() to accumulate
the statistics to the parent entry of either the entry being released, or a
previously released entry if it was higher up in the stack, so they compose
correctly regardless of release order.

i.e. its worth noting that its explicitly not the parent of the stack
entry being released, since that parent may already have been released
itself (since cleanup can be out of order).

> > +Memory Handling
> > +===============
> > +
> > +Instrumentation objects that use the stack must survive until finalization
> > +runs, including the abort case.  To ensure this, QueryInstrumentation
> > +creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
> > +of TopMemoryContext.  All child instrumentation (nodes, triggers) should be
> > +allocated in this context.
>
> > +On successful completion, instr_cxt is reparented to CurrentMemoryContext
> > +so its lifetime is tied to the caller's context.  On abort, the
> > +ResourceOwner cleanup frees it after accumulating the instrumentation data
> > +to the current stack entry after resetting the stack.
>
> Makes sense.
>
> I mildly wonder if we should create one minimally sized "Instrumentations"
> node under TopMemoryContext, below which the "Instrumentation" contexts are
> created, instead of doing so directly under TopMemoryContext.  But that's
> something that can easily be evolved later.
>

Sure, grouping the instrumentation memory contexts could make sense,
though the existing handling already serves the purpose of clearly
showing when there is an instrumentation leak.

I'll not make this change for now to avoid code churn, but as you note
we could always adjust that part later.

>
> > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >       estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> >       estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> >       estate->es_top_eflags = eflags;
> > -     estate->es_instrument = queryDesc->instrument_options;
> >       estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
> >
> > +     /*
> > +      * Set up query-level instrumentation if needed. We do this before
> > +      * InitPlan so that node and trigger instrumentation can be allocated
> > +      * within the query's dedicated instrumentation memory context.
> > +      */
> > +     if (!queryDesc->totaltime && queryDesc->instrument_options)
> > +     {
> > +             queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> > +             estate->es_instrument = queryDesc->totaltime;
> > +     }
> > +
> >       /*
> >        * Set up an AFTER-trigger statement context, unless told not to, or
> >        * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
>
> It seems pretty weird to still have queryDesc->totaltime *sometimes* created
> by pgss etc, but also create it in standard_ExecutorStart if not already
> created.  What if the explain options aren't compatible?  Sure
> pgss/auto_explain use ALL, but that's not a given.

Yeah, I think in practice all use cases I've ever seen pass
INSTRUMENT_ALL (and in fact it won't behave sane if this differs
between extensions), but you're right there is no guarantee.

Overall, there are two aspects to this:

1) Query instrumentation as the parent for node instrumentation,
driven by use of EXPLAIN or auto_explain setting
queryDesc->instrument_options

2) Instrumentation as a mechanism to measure the activity of a query,
as used by pg_stat_statements or auto_explain (to get the runtime /
aggregate buffer usage)

I could see two solutions:

A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get
es_instrument, any extensions measuring aggregate activity get
query->totaltime)

B) Have one internal QueryInstrumentation (that's responsible to be
the abort "parent" to both node instrumentation, and query->totaltime)

I was initially thinking we could maybe combine them creatively (i.e.
expand on what we've done so far), but I'm not sure there is a
reasonable design that isn't convoluted. We could also have a way for
extensions to "request" a certain level of instrumentation (instead of
directly allocating it), but it seems the current hooks are
insufficient for that.

I've gone with solution (A) for now, with es_instrument being
allocated when per-node instrumentation is needed. Obviously that gets
us two ResOwner cleanups instead of one when e.g. auto_explain is
active, but I think that's still acceptable. It also shows how its
easy to do an extra level of nesting with the stack-based
instrumentation, without too much expense.

With this in place, I do wonder if we should avoid the full memory
context setup in InstrQueryAlloc (i.e. instead just make a direct
allocation), unless we know that children are going to be attached.
The downside of that would be that we can't just re-assign the
instr_cxt in InstrQueryStopFinalize (we'd have to go back to the
previous logic of doing a memcpy into the callers context, for the
no-children case), but it might make a notable performance difference?

>
> > +     /* Start up instrumentation for this execution run */
> >       if (queryDesc->totaltime)
> > -             InstrStart(queryDesc->totaltime);
> > +     {
> > +             InstrQueryStart(queryDesc->totaltime);
> > +
> > +             /*
> > +              * Remember all node entries for abort recovery. We do this once here
> > +              * after InstrQueryStart has pushed the parent stack entry.
> > +              */
> > +             if (estate->es_instrument &&
> > +                     estate->es_instrument->instr.need_stack &&
> > +                     !queryDesc->already_executed)
> > +                     ExecRememberNodeInstrumentation(queryDesc->planstate,
> > +                                                                                     queryDesc->totaltime);
> > +     }
>
> Hm. Was briefly worried about the overhead of
> ExecRememberNodeInstrumentation() in the context of cursors. But I see it's
> only done once.
>
> But why do we not just associate the NodeInstrumentation's with the
> QueryInstrumentation during the creation of the NodeInstrumentation?

That's a good point - if I recall correctly that was structured
differently in an earlier commit, hence the complexity. But this is no
longer necessary, and allows us to drop the
ExecRememberNodeInstrumentation machinery. Nice :)

>
> > +     /*
> > +      * Accumulate per-node and trigger statistics to their respective parent
> > +      * instrumentation stacks.
> >
> > +      * We skip this in parallel workers because their per-node stats are
> > +      * reported individually via ExecParallelReportInstrumentation, and the
> > +      * leader's own ExecFinalizeNodeInstrumentation handles propagation.  If
> > +      * we accumulated here, the leader would double-count: worker parent nodes
> > +      * would already include their children's stats, and then the leader's
> > +      * accumulation would add the children again.
> > +      */
>
> Haven't looked into how this all works in sufficient detail, so I'm just
> asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE)
> showing per-worker "subtrees"?

Yeah, that's a good question, and you indeed found a bug - that was
not correctly accumulating up for the per-worker node information. The
main complexity here is the avoidance of double counting.

I can think of two very different approaches to solve this:

1) Have finalization only be responsible for accumulating into the
overall query instrumentation (or whichever instrumentation is
active), and not bother with adding per-node instrumentation to the
parent node at all. Then, in explain.c, do the accumulation. If we
ever wanted to invent a "BUFFERS SELF" type option (i.e. don't add
them to the parent), that would be the way to go. It'd also make it
easier to support accumulation for other types of statistics being
added (e.g. "EXPLAIN (IO)").

2) Specifically walk the worker instrumentation after it has been
retrieved (to avoid double counting), and add to each nodes parents.

For now I've gone with (2) and added a dedicated
ExecFinalizeWorkerInstrumentation function to deal with this.

>
>
> > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> >                       palloc0_array(FmgrInfo, n);
> >               resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> >                       palloc0_array(ExprState *, n);
> > -             if (instrument_options)
> > -                     resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> > +             if (qinstr)
> > +                     resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
>
> Hm. Why do we not need to pass down the instrument_options anymore? I guess
> the assumption is that we always are going to use the flags from qinstr?
>
> Is that right? Because right now pgss/auto_explain use _ALL, even when an
> EXPLAIN ANALYZE doesn't.
>

With the solution mentioned earlier, where es_instrument is a separate
allocation, this problem now goes away without any extra changes
needed.

Overall, I think its reasonable to make node/trigger instrumentation
be attached to a query instrumentation that has the instrumentation
options set that should be applied. That way we don't have think about
edge cases like a query instrumentation that doesn't need a stack, but
children that do.


>
> > @@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
> >       instrument = GetInstrumentationArray(instrumentation);
> >       instrument += i * instrumentation->num_workers;
> >       for (n = 0; n < instrumentation->num_workers; ++n)
> > +     {
> >               InstrAggNode(planstate->instrument, &instrument[n]);
> >
> > +             /*
> > +              * Also add worker WAL usage to the global pgWalUsage counter.
> > +              *
> > +              * When per-node instrumentation is active, parallel workers skip
> > +              * ExecFinalizeNodeInstrumentation (to avoid double-counting in
> > +              * EXPLAIN), so per-node WAL activity is not rolled up into the
> > +              * query-level stats that InstrAccumParallelQuery receives. Without
> > +              * this, pgWalUsage would under-report WAL generated by parallel
> > +              * workers when instrumentation is active.
> > +              */
> > +             WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
> > +     }
>
> I'm not sure I understand why this doesn't also lead to double counting, given
> that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage?
>

That can be explained by the somewhat hard to follow difference
between InstrAccumParallelQuery and InstrAggNode, which is a
pre-existing situation:

InstrAccumParallelQuery is used for accumulating the top level worker
instrumentation into the instrumentation that's active when
ExecParallelFinish runs. The active instrumentation at that point is
either the query's instrumentation (or if not used, instr_top), or the
Gather node.

InstrAggNode is used for accumulating each worker's per-node
instrumentation into the leader's per-node instrumentation.

When per-node instrumentation is active (node->instrument is
initialized), the WalUsageAdd occurs in both InstrAccumParallelQuery
and InstrAggNode - but per the comment in standard_ExecutorFinish, we
don't aggregate the per-node instrumentation to the top level of the
parallel worker - and therefore InstrAccumParallelQuery would report
basically no activity.

I tried to explain this in the comment above WalUsageAdd, but maybe
this needs further clarification?

>
> > +static bool
> > +ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
> > +{
> > +     Instrumentation *parent = (Instrumentation *) context;
> > +
> > +     Assert(parent != NULL);
> > +
> > +     if (node == NULL)
> > +             return false;
> > +
> > +     /*
> > +      * Recurse into children first (bottom-up accumulation), passing our
> > +      * instrumentation as the parent context.  This ensures children can
> > +      * accumulate to us even if they were never executed by the leader (e.g.
> > +      * nodes beneath Gather that only workers ran).
> > +      */
> > +     planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
> > +                                               node->instrument ? &node->instrument->instr : parent);
>
> I don't think I understand that comment.  What changes if the leader's node
> was never executed?

I think that was a timing issue in an earlier iteration, where the
stack-based instrumentation data was a separate allocation from the
main node instrumentation.

Since that is no longer an issue, we can just require node->instrument
to be initialized here. Reworded the comment and added an assert.

>
> > diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
> > index bc551f95a08..6892706a83a 100644
> > --- a/src/backend/executor/instrument.c
> > +++ b/src/backend/executor/instrument.c
> > @@ -16,30 +16,46 @@
> >  #include <unistd.h>
> >
> >  #include "executor/instrument.h"
> > +#include "utils/memutils.h"
> > +#include "utils/resowner.h"
> >
> > -BufferUsage pgBufferUsage;
> > -static BufferUsage save_pgBufferUsage;
> >  WalUsage     pgWalUsage;
>
> Why do we still need pgWalUsage if we have the same data in instr_stack.

Yeah. That is because of two reasons:

1) The questionable use of pgWalUsage to inform pruneheap.c whether an
FPI occurred. I think using pgWalUsage for this is just wrong, it
should use its own flag/counter. This can't use the top level
instrumentation stack since it'd be updated too late (only on executor
finish, not as writes are going on).

2) The use of pgWalUsage to update cumulative WAL usage statistics. We
could adjust this by having separate "pgstat_count_wal_.." functions
(mirroring how we deal with cumulative buffer usage statistics), or by
pulling the information from instrumentation stack and accepting that
WAL statistics won't be refreshed whilst a query is executing (which
is probably not okay? i.e. we might then have to invent some mechanism
to periodically "flush" before the actual finalize).

Addressing (1) would be somewhat straightforward, so maybe the best
way fowrard is to do that, and then refactor this to use separate
"pgstat_count_wal_.." functions instead of keeping the pgWalUsage
global.

I'll not do that here for now, since I don't think the double writing
of WAL stats is performance critical, and we'd still do that anyway
when having separate "pgstat_count_wal_.." functions.

>
> > +/*
> > + * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
> > + *
> > + * Note that this intentionally allows passing a stack that is not the current
> > + * top, as can happen with PG_FINALLY, or resource owners, which don't have a
> > + * guaranteed cleanup order.
> > + *
> > + * We are careful here to achieve two goals:
> > + *
> > + * 1) Reset the stack to the parent of whichever of the released stack entries
> > + *    has the lowest index
> > + * 2) Accumulate all instrumentation to the currently active instrumentation,
> > + *    so that callers get a complete picture of activity, even after an abort
> > + */
> > +void
> > +InstrStopFinalize(Instrumentation *instr)
> > +{
> > +     int                     idx = -1;
> > +
> > +     for (int i = instr_stack.stack_size - 1; i >= 0; i--)
> > +     {
> > +             if (instr_stack.entries[i] == instr)
> > +             {
> > +                     idx = i;
> > +                     break;
> > +             }
> > +     }
>
> So this may not find a stack entry, because a prior call to
> InstrStopFinalize() already removed it from the stack, right?
>
> Makes it a bit more error prone. Maybe we should store whether the element is
> still on the stack in the Instrumentation, that way we a) can error out if we
> don't find it on the stack b) avoid searching the stack if already removed.

Yeah, that seems doable, added that as suggested.

It does add an extra instruction to InstrPushStack/InstrPopStack, but
that's probably not significant enough. We could always turn it into
an assert-only check if that's the case.

> >       if (instr->need_timer)
> > +             InstrStopTimer(instr);
> > +
> > +     InstrAccumStack(instr_stack.current, instr);
> > +}
>
> Not that it's a huge issue, but seems like it'd be neater if the need_timer
> thing weren't duplicated, but implemented by calling InstrStop()?

That'd be a problem since InstrStop also pops the stack (if
need_stack=true), and InstrStopFinalize already popped the stack right
before.

I think the only alternative here would be adding a flag on InstrStop,
but that seems worse to me.

>
> > +void
> > +InstrQueryStart(QueryInstrumentation *qinstr)
> > +{
> > +     InstrStart(&qinstr->instr);
> > +
> > +     if (qinstr->instr.need_stack)
> > +     {
> > +             Assert(CurrentResourceOwner != NULL);
> > +             qinstr->owner = CurrentResourceOwner;
> > +
> > +             ResourceOwnerEnlarge(qinstr->owner);
> > +             ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
> > +     }
> > +}
> > +
> > +void
> > +InstrQueryStop(QueryInstrumentation *qinstr)
> > +{
> > +     InstrStop(&qinstr->instr);
> > +
> > +     if (qinstr->instr.need_stack)
> > +     {
> > +             Assert(qinstr->owner != NULL);
> > +             ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
> > +             qinstr->owner = NULL;
> > +     }
> > +}
> > +
> > +void
> > +InstrQueryStopFinalize(QueryInstrumentation *qinstr)
> > +{
> > +     InstrStopFinalize(&qinstr->instr);
>
> Why are these Instr[Query]StopFinalize() rather than just
> Instr[Query]Finalize()?

If you're coming at this from a naming perspective: Mainly to make it
clear that these both stop the instrumentation, and finalize it. If we
only called it "Instr[Query]Finalize" it wouldn't be clear that there
isn't a missing "Stop" call.

Alternatively we could:

1) Require callers to do two separate function calls
2) Have a "finalize" argument to the Stop function. I had that in a
prior iteration, but felt it was easier to miss the subtle true/false
difference.

> > +/* start instrumentation during parallel executor startup */
> > +QueryInstrumentation *
> > +InstrStartParallelQuery(void)
> > +{
> > +     QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> > +
> > +     InstrQueryStart(qinstr);
> > +     return qinstr;
> > +}
>
> Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL?
>

That's reflecting the fact that parallel workers can only transport
these two instrumentation types. The 0006 patch removes that
hardcoding. I'll add a comment in the earlier patch for now, for
clarity.

>
> > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <lukas@fittl.com>
> > Date: Sat, 7 Mar 2026 17:52:24 -0800
> > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> >  instructions by inlining
> >
> > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> > ExecProcNodeInstr when starting/stopping instrumentation for that node.
> >
> > Previously each ExecProcNodeInstr would check which instrumentation
> > options are active in the InstrStartNode/InstrStopNode calls, and do the
> > corresponding work (timers, instrumentation stack, etc.). These
> > conditionals being checked for each tuple being emitted add up, and cause
> > non-optimal set of instructions to be generated by the compiler.
> >
> > Because we already have an existing mechanism to specify a function
> > pointer when instrumentation is enabled, we can instead create specialized
> > functions that are tailored to the instrumentation options enabled, and
> > avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> > test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> > top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> > from ~ 20% to ~ 10% on top of actual runtime.
>
> I assume this is to a significant degree due to to allowing for inlining. Have
> you checked how much of the effort you get by just putting ExecProcNodeInstr()
> into instrument.c?

Worth a try - I haven't tested that yet - I'll come back to this
separately and verify how much that buys us, vs spelling out the
different variants.

>
> > @@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
> >      then rejected by a recheck of the index condition.  This happens because a
> >      GiST index is <quote>lossy</quote> for polygon containment tests: it actually
> >      returns the rows with polygons that overlap the target, and then we have
> > -    to do the exact containment test on those rows.
> > +    to do the exact containment test on those rows. The <literal>Table Buffers</literal>
> > +    counts indicate how many operations were performed on the table instead of
> > +    the index. This number is included in the <literal>Buffers</literal> counts.
> >     </para>
> >
> >     <para>
>
> I wonder if listing "Index Buffers" separately, instead of "Table Buffers"
> would make more sense, because normally the number of index accesses is much
> smaller and therefore a bit easier to put into relation to "Buffers".
>

I don't think changing this to be focused on index buffers makes sense
(but my opinion is weakly held). My arguments for why it doesn't make
sense:

1) The primary activity of the node is the index (only) scan. The fact
that it also does table access is what we're trying to call out, just
like we're calling out heap fetches.
2) For index only scans the inverse of what you noted is true, i.e.
you'd expect many more index buffers with very little table buffers.
The fact that there were any table buffers at all is worth calling
out.

>
> > @@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
> >                                                       ItemPointerGetBlockNumber(tid),
> >                                                       &node->ioss_VMBuffer))
> >               {
> > +                     bool            found;
> > +
> >                       /*
> >                        * Rats, we have to visit the heap to check visibility.
> >                        */
> >                       InstrCountTuples2(node, 1);
> > -                     if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
> > +
> > +                     if (table_instr)
> > +                             InstrPushStack(table_instr);
> > +
> > +                     found = index_fetch_heap(scandesc, node->ioss_TableSlot);
> > +
> > +                     if (table_instr)
> > +                             InstrPopStack(table_instr);
> > +
> > +                     if (!found)
> >                               continue;               /* no visible tuple, try next index entry */
> >
> >                       ExecClearTuple(node->ioss_TableSlot);
>
> As-is this will unfortunately rather terribly conflict with the way the index
> prefetching patch is restructuring things, as after it neither index nor
> indexonly scan does the equivalent of index_fetch_heap() anymore.  This all
> goes through a tableam interface, which in turn will call to the index to get
> the tids (to allow for tableam specific prefetching logic, obviously).
>
> I think this would require putting this into the IndexScanDesc via the
> IndexScanInstrumentation etc.
>
>
> Might be good for you to look at how that stuff works after the index
> prefetching patch and comment if you see a problem.

Agreed, I'll look at that tomorrow. Well, today, I suppose, looking at
the clock..

Thanks,
Lukas

--
Lukas Fittl
Attachment

pgsql-hackers by date:
From: Alexander Korotkov
Date: 05 April, 15:31:46
Subject: Re: Implement waiting for wal lsn replay: reloaded
From: Henson Choi
Date: 05 April, 15:50:53
Subject: Re: Row pattern recognition