Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers
| From | Lukas Fittl |
|---|---|
| Subject | Re: Stack-based tracking of per-node WAL/buffer usage |
| Date | |
| Msg-id | CAP53PkxDupm5U6TV0LF_YWHQ2wfcSiLsgnDBcO6b5AchLJhp=A@mail.gmail.com Whole thread Raw |
| In response to | Re: Stack-based tracking of per-node WAL/buffer usage (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: Stack-based tracking of per-node WAL/buffer usage
Re: Stack-based tracking of per-node WAL/buffer usage Re: Stack-based tracking of per-node WAL/buffer usage |
| List | pgsql-hackers |
Hi Andres, Thanks for reviewing! On Sat, Apr 4, 2026 at 12:39 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2026-04-04 02:43:50 -0700, Lukas Fittl wrote: > > Attached v12, rebased, otherwise no changes. > > > > I realize time to freeze is getting close, and whilst I'd love to see > > this go in, I'm also realistic - so I'll just do my best to support > > review in the off chance we can make it for this release. > > I unfortunately think there's enough nontrivial design decisions - that I > don't have a sufficiently confident assesment of - that I would be pretty > hesitant to commit this at this stage of the cycle without some architectural > review by senior folks. If Heikki did another round or two of review, it'd be a > different story. Ack - I'll keep it updated as needed in the next days to help, but understand that there many competing priorities :) FWIW, I do feel like Heikki has taken a look at the memory management aspects, and Zsolt also had some good detailed feedback on lower level logic and OOM behaviour - so I'm actually less worried about these now. Heikki, your further review is very welcome, if you have the time. It'd also be great if you could review the README.instrument (now in v13/0008) to see if that makes sense to you. > I think the high level design is a huge improvement and goes in the right > direction, but some of the lower level stuff I'm far less confident about. Glad to hear - I also think that the direction here is right, and I don't think there is a lot of variance on architectural choices anymore. > > > On that note, I think 0001 and 0002 are independently useful > > refactorings to split the different kinds of instrumentation that > > should be ready to go, and I don't think should conflict much with > > other patches in this commitfest. > > Yea, I'll see that those get committed. > > I could also see 0004 as potentially worth getting committed separately, > although I'm a bit worried about test stability. Yeah, I understand. FWIW, the current approach has been stable in CI runs, but that doesn't mean something in the buildfarm won't complain. > I recently looked at a coverage report, in the context of the index > prefetching patch, and was a it surprised that prominent parallel executor > nodes have no coverage for EXPLAIN ANALYZE. > parallel BHS is not covered: > https://coverage.postgresql.org/src/backend/executor/nodeBitmapHeapscan.c.gcov.html#L536 > parallel IOS is not covered: > https://coverage.postgresql.org/src/backend/executor/nodeIndexonlyscan.c.gcov.html#L430 Ugh. Good catch, that lack of coverage is not good. I've added two tests for that, and guess what, I found a bug (unrelated to stack-based instrumentation) - show_tidbitmap_info doesn't correctly accumulate Heap Blocks information from the worker instrumentation - it only ever shows that of the leader. For such auxiliary information in a parallel context it needs to do the extra work, e.g. like show_indexsearches_info does. I've put that bugfix and BHS coverage into its own commit (0004) before the other tests, because I suspect we may even want to backpatch that. I've added coverage of IOS via a check for Index Searches as a new patch as well (0005). Let me know if you prefer if I start a new thread for those. See attached v13, with feedback addressed, unless otherwise noted below. I've also slightly simplified InstrAggNode, since I realized it is never called when running=true. > > > From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001 > > From: Lukas Fittl <lukas@fittl.com> > > Date: Tue, 9 Sep 2025 02:16:59 -0700 > > Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through > > stack-based instrumentation > > > > Previously, in order to determine the buffer/WAL usage of a given code > > section, we utilized continuously incrementing global counters that get > > updated when the actual activity (e.g. shared block read) occurred, and > > then calculated a diff when the code section ended. This resulted in a > > bottleneck for executor node instrumentation specifically, with the > > function BufferUsageAccumDiff showing up in profiles and in some cases > > adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run. > > > > Instead, introduce a stack-based mechanism, where the actual activity > > writes into the current stack entry. In the case of executor nodes, this > > means that each node gets its own stack entry that is pushed at > > InstrStartNode, and popped at InstrEndNode. Stack entries are zero > > initialized (avoiding the diff mechanism) and get added to their parent > > entry when they are finalized, i.e. no more modifications can occur. > > > > To correctly handle abort situations, any use of instrumentation stacks > > must involve either a top-level QueryInstrumentation struct, and its > > associated InstrQueryStart/InstrQueryStop helpers (which use resource > > owners to handle aborts), or the Instrumentation struct itself with > > dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a > > consistent state after an abort. > > > > This also drops the global pgBufferUsage, any callers interested in > > measuring buffer activity should instead utilize InstrStart/InstrStop. > > > > The related global pgWalUsage is kept for now due to its use in pgstat > > to track aggregate WAL activity and heap_page_prune_and_freeze for > > measuring FPIs. > > Probably worth stating what the performance overhead of WAL and BUFFERS is > after this patch? I've added a note re: BUFFERS referencing the COUNT(*) numbers from earlier in the thread - not sure if WAL is really worth it to talk about (its already quite a long commit message). > > > > @@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags) > > */ > > if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0)) > > { > > - /* > > - * Set up to track total elapsed time in ExecutorRun. Make sure the > > - * space is allocated in the per-query context so it will go away at > > - * ExecutorEnd. > > - */ > > + /* Set up to track total elapsed time in ExecutorRun. */ > > if (queryDesc->totaltime == NULL) > > - { > > - MemoryContext oldcxt; > > - > > - oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt); > > - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL); > > - MemoryContextSwitchTo(oldcxt); > > - } > > + queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL); > > } > > } > > Not at all the fault of this patch, but it does seem somewhat odd to me that > we handle pgss/auto_explain wanting instrumentation by them updating the > QueryDesc->totaltime, rather than having extensions add an eflag to ask > standard_ExecutorStart to do so. Agreed, that interaction is a bit odd. I think it'd be reasonable to do this as a separate refactoring, especially now that I've stopped utilizing totaltime as the parent for per-node instrumentation, per later notes. > > > @@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index, > > * and PARALLEL_KEY_BUFFER_USAGE. > > * > > * If there are no extensions loaded that care, we could skip this. We > > - * have no way of knowing whether anyone's looking at pgWalUsage or > > - * pgBufferUsage, so do it unconditionally. > > + * have no way of knowing whether anyone's looking at instrumentation, so > > + * do it unconditionally. > > */ > > shm_toc_estimate_chunk(&pcxt->estimator, > > mul_size(sizeof(WalUsage), pcxt->nworkers)); > > @@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc) > > Relation indexRel; > > LOCKMODE heapLockmode; > > LOCKMODE indexLockmode; > > + QueryInstrumentation *instr; > > WalUsage *walusage; > > BufferUsage *bufferusage; > > int sortmem; > > @@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc) > > tuplesort_attach_shared(sharedsort, seg); > > > > /* Prepare to track buffer usage during parallel execution */ > > - InstrStartParallelQuery(); > > + instr = InstrStartParallelQuery(); > > > > /* > > * Might as well use reliable figure when doling out maintenance_work_mem > > @@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc) > > /* Report WAL/buffer usage during parallel execution */ > > bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false); > > walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false); > > - InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > + InstrEndParallelQuery(instr, > > + &bufferusage[ParallelWorkerNumber], > > &walusage[ParallelWorkerNumber]); > > > > index_close(indexRel, indexLockmode); > > Again not your fault, but it feels like the parallel index build > infrastructure is all wrong. Reimplementing this stuff for every index type > makes no sense. Fully agreed, the duplication here is quite something. The 0006 patch cleans this up a little bit by at least using the Instrumentation struct consistently. That is not required for the stack-based commit, but would be helpful if e.g. we were to put IO stats next to Buffer/WAL stats in the future. > > +For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish, > > +its instrumentation data gets added to the immediate parent in > > +the execution tree, the NestLoop, which will then get added to Query A's > > +QueryInstrumentation, which then accumulates to the parent. > > + > > +While we can typically think of this as a tree, the NodeInstrumentation > > +underneath a particular QueryInstrumentation could behave differently -- > > +for example, it could propagate directly to the QueryInstrumentation, in > > +order to not show cumulative numbers in EXPLAIN ANALYZE. > > Hm. This seems like a somewhat random example, why would one want this? > Hmm, yeah. I mainly included this because the fact that accumulation for the individual nodes in the EXPLAIN happens to be in a tree-like structure is a choice, no longer a requirement. It would be just as easy to only accumulate to the parent QueryInstrumentation, and let explain.c present you a choice of "BUFFERS SELF" / etc - we couldn't have done that previously. I've left this in for now, but maybe its better to drop it to avoid confusion? > > +If multiple QueryInstrumentations are active on the stack (e.g. nested > > +portals), each one's abort handler uses InstrStopFinalize to unwind to > > +whichever entry is higher up, so they compose correctly regardless of > > +release order. > > Maybe "the abort handler of each uses InstrStopFinalize() to accumulate the > statics to its parent entry"? I revised this as follows: If multiple QueryInstrumentations are active on the stack (e.g. nested portals), the abort handler of each uses InstrStopFinalize() to accumulate the statistics to the parent entry of either the entry being released, or a previously released entry if it was higher up in the stack, so they compose correctly regardless of release order. i.e. its worth noting that its explicitly not the parent of the stack entry being released, since that parent may already have been released itself (since cleanup can be out of order). > > +Memory Handling > > +=============== > > + > > +Instrumentation objects that use the stack must survive until finalization > > +runs, including the abort case. To ensure this, QueryInstrumentation > > +creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child > > +of TopMemoryContext. All child instrumentation (nodes, triggers) should be > > +allocated in this context. > > > +On successful completion, instr_cxt is reparented to CurrentMemoryContext > > +so its lifetime is tied to the caller's context. On abort, the > > +ResourceOwner cleanup frees it after accumulating the instrumentation data > > +to the current stack entry after resetting the stack. > > Makes sense. > > I mildly wonder if we should create one minimally sized "Instrumentations" > node under TopMemoryContext, below which the "Instrumentation" contexts are > created, instead of doing so directly under TopMemoryContext. But that's > something that can easily be evolved later. > Sure, grouping the instrumentation memory contexts could make sense, though the existing handling already serves the purpose of clearly showing when there is an instrumentation leak. I'll not make this change for now to avoid code churn, but as you note we could always adjust that part later. > > > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags) > > estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot); > > estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot); > > estate->es_top_eflags = eflags; > > - estate->es_instrument = queryDesc->instrument_options; > > estate->es_jit_flags = queryDesc->plannedstmt->jitFlags; > > > > + /* > > + * Set up query-level instrumentation if needed. We do this before > > + * InitPlan so that node and trigger instrumentation can be allocated > > + * within the query's dedicated instrumentation memory context. > > + */ > > + if (!queryDesc->totaltime && queryDesc->instrument_options) > > + { > > + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options); > > + estate->es_instrument = queryDesc->totaltime; > > + } > > + > > /* > > * Set up an AFTER-trigger statement context, unless told not to, or > > * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called). > > It seems pretty weird to still have queryDesc->totaltime *sometimes* created > by pgss etc, but also create it in standard_ExecutorStart if not already > created. What if the explain options aren't compatible? Sure > pgss/auto_explain use ALL, but that's not a given. Yeah, I think in practice all use cases I've ever seen pass INSTRUMENT_ALL (and in fact it won't behave sane if this differs between extensions), but you're right there is no guarantee. Overall, there are two aspects to this: 1) Query instrumentation as the parent for node instrumentation, driven by use of EXPLAIN or auto_explain setting queryDesc->instrument_options 2) Instrumentation as a mechanism to measure the activity of a query, as used by pg_stat_statements or auto_explain (to get the runtime / aggregate buffer usage) I could see two solutions: A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get es_instrument, any extensions measuring aggregate activity get query->totaltime) B) Have one internal QueryInstrumentation (that's responsible to be the abort "parent" to both node instrumentation, and query->totaltime) I was initially thinking we could maybe combine them creatively (i.e. expand on what we've done so far), but I'm not sure there is a reasonable design that isn't convoluted. We could also have a way for extensions to "request" a certain level of instrumentation (instead of directly allocating it), but it seems the current hooks are insufficient for that. I've gone with solution (A) for now, with es_instrument being allocated when per-node instrumentation is needed. Obviously that gets us two ResOwner cleanups instead of one when e.g. auto_explain is active, but I think that's still acceptable. It also shows how its easy to do an extra level of nesting with the stack-based instrumentation, without too much expense. With this in place, I do wonder if we should avoid the full memory context setup in InstrQueryAlloc (i.e. instead just make a direct allocation), unless we know that children are going to be attached. The downside of that would be that we can't just re-assign the instr_cxt in InstrQueryStopFinalize (we'd have to go back to the previous logic of doing a memcpy into the callers context, for the no-children case), but it might make a notable performance difference? > > > + /* Start up instrumentation for this execution run */ > > if (queryDesc->totaltime) > > - InstrStart(queryDesc->totaltime); > > + { > > + InstrQueryStart(queryDesc->totaltime); > > + > > + /* > > + * Remember all node entries for abort recovery. We do this once here > > + * after InstrQueryStart has pushed the parent stack entry. > > + */ > > + if (estate->es_instrument && > > + estate->es_instrument->instr.need_stack && > > + !queryDesc->already_executed) > > + ExecRememberNodeInstrumentation(queryDesc->planstate, > > + queryDesc->totaltime); > > + } > > Hm. Was briefly worried about the overhead of > ExecRememberNodeInstrumentation() in the context of cursors. But I see it's > only done once. > > But why do we not just associate the NodeInstrumentation's with the > QueryInstrumentation during the creation of the NodeInstrumentation? That's a good point - if I recall correctly that was structured differently in an earlier commit, hence the complexity. But this is no longer necessary, and allows us to drop the ExecRememberNodeInstrumentation machinery. Nice :) > > > + /* > > + * Accumulate per-node and trigger statistics to their respective parent > > + * instrumentation stacks. > > > > + * We skip this in parallel workers because their per-node stats are > > + * reported individually via ExecParallelReportInstrumentation, and the > > + * leader's own ExecFinalizeNodeInstrumentation handles propagation. If > > + * we accumulated here, the leader would double-count: worker parent nodes > > + * would already include their children's stats, and then the leader's > > + * accumulation would add the children again. > > + */ > > Haven't looked into how this all works in sufficient detail, so I'm just > asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE) > showing per-worker "subtrees"? Yeah, that's a good question, and you indeed found a bug - that was not correctly accumulating up for the per-worker node information. The main complexity here is the avoidance of double counting. I can think of two very different approaches to solve this: 1) Have finalization only be responsible for accumulating into the overall query instrumentation (or whichever instrumentation is active), and not bother with adding per-node instrumentation to the parent node at all. Then, in explain.c, do the accumulation. If we ever wanted to invent a "BUFFERS SELF" type option (i.e. don't add them to the parent), that would be the way to go. It'd also make it easier to support accumulation for other types of statistics being added (e.g. "EXPLAIN (IO)"). 2) Specifically walk the worker instrumentation after it has been retrieved (to avoid double counting), and add to each nodes parents. For now I've gone with (2) and added a dedicated ExecFinalizeWorkerInstrumentation function to deal with this. > > > > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, > > palloc0_array(FmgrInfo, n); > > resultRelInfo->ri_TrigWhenExprs = (ExprState **) > > palloc0_array(ExprState *, n); > > - if (instrument_options) > > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options); > > + if (qinstr) > > + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n); > > Hm. Why do we not need to pass down the instrument_options anymore? I guess > the assumption is that we always are going to use the flags from qinstr? > > Is that right? Because right now pgss/auto_explain use _ALL, even when an > EXPLAIN ANALYZE doesn't. > With the solution mentioned earlier, where es_instrument is a separate allocation, this problem now goes away without any extra changes needed. Overall, I think its reasonable to make node/trigger instrumentation be attached to a query instrumentation that has the instrumentation options set that should be applied. That way we don't have think about edge cases like a query instrumentation that doesn't need a stack, but children that do. > > > @@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate, > > instrument = GetInstrumentationArray(instrumentation); > > instrument += i * instrumentation->num_workers; > > for (n = 0; n < instrumentation->num_workers; ++n) > > + { > > InstrAggNode(planstate->instrument, &instrument[n]); > > > > + /* > > + * Also add worker WAL usage to the global pgWalUsage counter. > > + * > > + * When per-node instrumentation is active, parallel workers skip > > + * ExecFinalizeNodeInstrumentation (to avoid double-counting in > > + * EXPLAIN), so per-node WAL activity is not rolled up into the > > + * query-level stats that InstrAccumParallelQuery receives. Without > > + * this, pgWalUsage would under-report WAL generated by parallel > > + * workers when instrumentation is active. > > + */ > > + WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage); > > + } > > I'm not sure I understand why this doesn't also lead to double counting, given > that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage? > That can be explained by the somewhat hard to follow difference between InstrAccumParallelQuery and InstrAggNode, which is a pre-existing situation: InstrAccumParallelQuery is used for accumulating the top level worker instrumentation into the instrumentation that's active when ExecParallelFinish runs. The active instrumentation at that point is either the query's instrumentation (or if not used, instr_top), or the Gather node. InstrAggNode is used for accumulating each worker's per-node instrumentation into the leader's per-node instrumentation. When per-node instrumentation is active (node->instrument is initialized), the WalUsageAdd occurs in both InstrAccumParallelQuery and InstrAggNode - but per the comment in standard_ExecutorFinish, we don't aggregate the per-node instrumentation to the top level of the parallel worker - and therefore InstrAccumParallelQuery would report basically no activity. I tried to explain this in the comment above WalUsageAdd, but maybe this needs further clarification? > > > +static bool > > +ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context) > > +{ > > + Instrumentation *parent = (Instrumentation *) context; > > + > > + Assert(parent != NULL); > > + > > + if (node == NULL) > > + return false; > > + > > + /* > > + * Recurse into children first (bottom-up accumulation), passing our > > + * instrumentation as the parent context. This ensures children can > > + * accumulate to us even if they were never executed by the leader (e.g. > > + * nodes beneath Gather that only workers ran). > > + */ > > + planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker, > > + node->instrument ? &node->instrument->instr : parent); > > I don't think I understand that comment. What changes if the leader's node > was never executed? I think that was a timing issue in an earlier iteration, where the stack-based instrumentation data was a separate allocation from the main node instrumentation. Since that is no longer an issue, we can just require node->instrument to be initialized here. Reworded the comment and added an assert. > > > diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c > > index bc551f95a08..6892706a83a 100644 > > --- a/src/backend/executor/instrument.c > > +++ b/src/backend/executor/instrument.c > > @@ -16,30 +16,46 @@ > > #include <unistd.h> > > > > #include "executor/instrument.h" > > +#include "utils/memutils.h" > > +#include "utils/resowner.h" > > > > -BufferUsage pgBufferUsage; > > -static BufferUsage save_pgBufferUsage; > > WalUsage pgWalUsage; > > Why do we still need pgWalUsage if we have the same data in instr_stack. Yeah. That is because of two reasons: 1) The questionable use of pgWalUsage to inform pruneheap.c whether an FPI occurred. I think using pgWalUsage for this is just wrong, it should use its own flag/counter. This can't use the top level instrumentation stack since it'd be updated too late (only on executor finish, not as writes are going on). 2) The use of pgWalUsage to update cumulative WAL usage statistics. We could adjust this by having separate "pgstat_count_wal_.." functions (mirroring how we deal with cumulative buffer usage statistics), or by pulling the information from instrumentation stack and accepting that WAL statistics won't be refreshed whilst a query is executing (which is probably not okay? i.e. we might then have to invent some mechanism to periodically "flush" before the actual finalize). Addressing (1) would be somewhat straightforward, so maybe the best way fowrard is to do that, and then refactor this to use separate "pgstat_count_wal_.." functions instead of keeping the pgWalUsage global. I'll not do that here for now, since I don't think the double writing of WAL stats is performance critical, and we'd still do that anyway when having separate "pgstat_count_wal_.." functions. > > > +/* > > + * Stops instrumentation, finalizes the stack entry and accumulates to its parent. > > + * > > + * Note that this intentionally allows passing a stack that is not the current > > + * top, as can happen with PG_FINALLY, or resource owners, which don't have a > > + * guaranteed cleanup order. > > + * > > + * We are careful here to achieve two goals: > > + * > > + * 1) Reset the stack to the parent of whichever of the released stack entries > > + * has the lowest index > > + * 2) Accumulate all instrumentation to the currently active instrumentation, > > + * so that callers get a complete picture of activity, even after an abort > > + */ > > +void > > +InstrStopFinalize(Instrumentation *instr) > > +{ > > + int idx = -1; > > + > > + for (int i = instr_stack.stack_size - 1; i >= 0; i--) > > + { > > + if (instr_stack.entries[i] == instr) > > + { > > + idx = i; > > + break; > > + } > > + } > > So this may not find a stack entry, because a prior call to > InstrStopFinalize() already removed it from the stack, right? > > Makes it a bit more error prone. Maybe we should store whether the element is > still on the stack in the Instrumentation, that way we a) can error out if we > don't find it on the stack b) avoid searching the stack if already removed. Yeah, that seems doable, added that as suggested. It does add an extra instruction to InstrPushStack/InstrPopStack, but that's probably not significant enough. We could always turn it into an assert-only check if that's the case. > > if (instr->need_timer) > > + InstrStopTimer(instr); > > + > > + InstrAccumStack(instr_stack.current, instr); > > +} > > Not that it's a huge issue, but seems like it'd be neater if the need_timer > thing weren't duplicated, but implemented by calling InstrStop()? That'd be a problem since InstrStop also pops the stack (if need_stack=true), and InstrStopFinalize already popped the stack right before. I think the only alternative here would be adding a flag on InstrStop, but that seems worse to me. > > > +void > > +InstrQueryStart(QueryInstrumentation *qinstr) > > +{ > > + InstrStart(&qinstr->instr); > > + > > + if (qinstr->instr.need_stack) > > + { > > + Assert(CurrentResourceOwner != NULL); > > + qinstr->owner = CurrentResourceOwner; > > + > > + ResourceOwnerEnlarge(qinstr->owner); > > + ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr); > > + } > > +} > > + > > +void > > +InstrQueryStop(QueryInstrumentation *qinstr) > > +{ > > + InstrStop(&qinstr->instr); > > + > > + if (qinstr->instr.need_stack) > > + { > > + Assert(qinstr->owner != NULL); > > + ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr); > > + qinstr->owner = NULL; > > + } > > +} > > + > > +void > > +InstrQueryStopFinalize(QueryInstrumentation *qinstr) > > +{ > > + InstrStopFinalize(&qinstr->instr); > > Why are these Instr[Query]StopFinalize() rather than just > Instr[Query]Finalize()? If you're coming at this from a naming perspective: Mainly to make it clear that these both stop the instrumentation, and finalize it. If we only called it "Instr[Query]Finalize" it wouldn't be clear that there isn't a missing "Stop" call. Alternatively we could: 1) Require callers to do two separate function calls 2) Have a "finalize" argument to the Stop function. I had that in a prior iteration, but felt it was easier to miss the subtle true/false difference. > > +/* start instrumentation during parallel executor startup */ > > +QueryInstrumentation * > > +InstrStartParallelQuery(void) > > +{ > > + QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL); > > + > > + InstrQueryStart(qinstr); > > + return qinstr; > > +} > > Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL? > That's reflecting the fact that parallel workers can only transport these two instrumentation types. The 0006 patch removes that hardcoding. I'll add a comment in the earlier patch for now, for clarity. > > > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001 > > From: Lukas Fittl <lukas@fittl.com> > > Date: Sat, 7 Mar 2026 17:52:24 -0800 > > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr > > instructions by inlining > > > > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in > > ExecProcNodeInstr when starting/stopping instrumentation for that node. > > > > Previously each ExecProcNodeInstr would check which instrumentation > > options are active in the InstrStartNode/InstrStopNode calls, and do the > > corresponding work (timers, instrumentation stack, etc.). These > > conditionals being checked for each tuple being emitted add up, and cause > > non-optimal set of instructions to be generated by the compiler. > > > > Because we already have an existing mechanism to specify a function > > pointer when instrumentation is enabled, we can instead create specialized > > functions that are tailored to the instrumentation options enabled, and > > avoid conditionals on subsequent ExecProcNodeInstr calls. This results in > > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress > > test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on > > top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes > > from ~ 20% to ~ 10% on top of actual runtime. > > I assume this is to a significant degree due to to allowing for inlining. Have > you checked how much of the effort you get by just putting ExecProcNodeInstr() > into instrument.c? Worth a try - I haven't tested that yet - I'll come back to this separately and verify how much that buys us, vs spelling out the different variants. > > > @@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)'; > > then rejected by a recheck of the index condition. This happens because a > > GiST index is <quote>lossy</quote> for polygon containment tests: it actually > > returns the rows with polygons that overlap the target, and then we have > > - to do the exact containment test on those rows. > > + to do the exact containment test on those rows. The <literal>Table Buffers</literal> > > + counts indicate how many operations were performed on the table instead of > > + the index. This number is included in the <literal>Buffers</literal> counts. > > </para> > > > > <para> > > I wonder if listing "Index Buffers" separately, instead of "Table Buffers" > would make more sense, because normally the number of index accesses is much > smaller and therefore a bit easier to put into relation to "Buffers". > I don't think changing this to be focused on index buffers makes sense (but my opinion is weakly held). My arguments for why it doesn't make sense: 1) The primary activity of the node is the index (only) scan. The fact that it also does table access is what we're trying to call out, just like we're calling out heap fetches. 2) For index only scans the inverse of what you noted is true, i.e. you'd expect many more index buffers with very little table buffers. The fact that there were any table buffers at all is worth calling out. > > > @@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node) > > ItemPointerGetBlockNumber(tid), > > &node->ioss_VMBuffer)) > > { > > + bool found; > > + > > /* > > * Rats, we have to visit the heap to check visibility. > > */ > > InstrCountTuples2(node, 1); > > - if (!index_fetch_heap(scandesc, node->ioss_TableSlot)) > > + > > + if (table_instr) > > + InstrPushStack(table_instr); > > + > > + found = index_fetch_heap(scandesc, node->ioss_TableSlot); > > + > > + if (table_instr) > > + InstrPopStack(table_instr); > > + > > + if (!found) > > continue; /* no visible tuple, try next index entry */ > > > > ExecClearTuple(node->ioss_TableSlot); > > As-is this will unfortunately rather terribly conflict with the way the index > prefetching patch is restructuring things, as after it neither index nor > indexonly scan does the equivalent of index_fetch_heap() anymore. This all > goes through a tableam interface, which in turn will call to the index to get > the tids (to allow for tableam specific prefetching logic, obviously). > > I think this would require putting this into the IndexScanDesc via the > IndexScanInstrumentation etc. > > > Might be good for you to look at how that stuff works after the index > prefetching patch and comment if you see a problem. Agreed, I'll look at that tomorrow. Well, today, I suppose, looking at the clock.. Thanks, Lukas -- Lukas Fittl
Attachment
- v13-0002-instrumentation-Separate-per-node-logic-from-oth.patch
- v13-0003-instrumentation-Use-Instrumentation-instead-of-m.patch
- v13-0005-Parallel-Bitmap-Heap-Scan-Fix-EXPLAIN-reporting-.patch
- v13-0001-instrumentation-Separate-trigger-logic-from-othe.patch
- v13-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch
- v13-0007-instrumentation-Add-additional-regression-tests-.patch
- v13-0006-Add-regression-test-coverage-for-EXPLAIN-of-Para.patch
- v13-0009-instrumentation-Use-Instrumentation-struct-for-p.patch
- v13-0008-Optimize-measuring-WAL-buffer-usage-through-stac.patch
- v13-0010-instrumentation-Optimize-ExecProcNodeInstr-instr.patch
- v13-0011-Index-scans-Show-table-buffer-accesses-separatel.patch
- v13-0012-Add-test_session_buffer_usage-test-module.patch
pgsql-hackers by date: