Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Stack-based tracking of per-node WAL/buffer usage |
Date | |
Msg-id | mflexfbmbkmql5snr322ldu72dfnjr2dntxx3ippaz4tcso7vl@zmxxgiws3qux Whole thread Raw |
In response to | Re: Stack-based tracking of per-node WAL/buffer usage (Lukas Fittl <lukas@fittl.com>) |
List | pgsql-hackers |
On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote: > On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote: > > > Attached an updated patch set that addresses the feedback, and also adds > > the complete removal of the global pgBufferUsage variable in later patches > > (0005-0007), to avoid counting both the stack and the variable. > > > > See attached the same patch set rebased on latest master. > From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <lukas@fittl.com> > Date: Sun, 31 Aug 2025 16:37:05 -0700 > Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime, > require caller to convert > > Previously the Instrumentation logic always converted to seconds, only for many > of the callers to do unnecessary division to get to milliseconds. Since an upcoming > refactoring will split the Instrumentation struct, utilize instrtime always to > keep things simpler. LGTM, think we should apply this regardless of the rest of the patches. > @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add) > dst->running = true; > dst->firsttuple = add->firsttuple; > } > - else if (dst->running && add->running && dst->firsttuple > add->firsttuple) > + else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple)) > dst->firsttuple = add->firsttuple; This isn't due to this patch, but it seems a bit odd that we use the minimum time for the first tuple, but the average time for the node's completion... > diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h > index f71a851b18d..646934020d1 100644 > --- a/src/include/portability/instr_time.h > +++ b/src/include/portability/instr_time.h > @@ -184,6 +184,8 @@ GetTimerFrequency(void) > #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ > ((x).ticks += (y).ticks - (z).ticks) > > +#define INSTR_TIME_CMP_LT(x,y) \ > + ((x).ticks > (y).ticks) > > #define INSTR_TIME_GET_DOUBLE(t) \ > ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S) > -- > 2.47.1 Any reason to actually have _CMP_ in the name? Other operations like _ADD don't have such an additional verb in the name. > From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <lukas@fittl.com> > Date: Sat, 1 Mar 2025 19:31:30 -0800 > Subject: [PATCH v3 2/7] Separate node instrumentation from other use of > Instrumentation struct > > Previously different places (e.g. query "total time") were repurposing > the Instrumentation struct initially introduced for capturing per-node > statistics during execution. This dual use of the struct is confusing, > e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated > code paths, and prevents future refactorings. > > Instead, simplify the Instrumentation struct to only track time, > WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop > outside of per-node instrumentation. Introduce the NodeInstrumentation > struct to carry forward the per-node instrumentation information. > @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc) > */ > oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt); > > - /* > - * Make sure stats accumulation is done. (Note: it's okay if several > - * levels of hook all do this.) > - */ > - InstrEndLoop(queryDesc->totaltime); > - > /* Log plan if duration is exceeded. */ > msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total); > if (msec >= auto_explain_log_min_duration) Maybe add a comment about the removal of these InstrEndLoop() calls to the commit message? If I understand correctly they were superfluous before, but that's not entirely obvious when just looking at the patch. > +/* > + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples > + * > + * Initialized through InstrAlloc, followed by one or more calls to a pair of > + * InstrStart/InstrStop (activity is measured inbetween). > + */ > typedef struct Instrumentation > +{ > + /* Parameters set at creation: */ > + bool need_timer; /* true if we need timer data */ > + bool need_bufusage; /* true if we need buffer usage data */ > + bool need_walusage; /* true if we need WAL usage data */ > + /* Internal state keeping: */ > + instr_time starttime; /* start time of last InstrStart */ > + BufferUsage bufusage_start; /* buffer usage at start */ > + WalUsage walusage_start; /* WAL usage at start */ > + /* Accumulated statistics: */ > + instr_time total; /* total runtime */ > + double ntuples; /* total tuples counted in InstrStop */ > + BufferUsage bufusage; /* total buffer usage */ > + WalUsage walusage; /* total WAL usage */ > +} Instrumentation; Maybe add a comment explaining why ntuples is in here? > From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <lukas@fittl.com> > Date: Tue, 9 Sep 2025 02:16:59 -0700 > Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage Could use a commit message :) > --- > .../pg_stat_statements/pg_stat_statements.c | 4 +- > src/backend/commands/explain.c | 8 +- > src/backend/commands/trigger.c | 4 +- > src/backend/executor/execMain.c | 25 ++- > src/backend/executor/execProcnode.c | 29 +++ > src/backend/executor/instrument.c | 199 ++++++++++++++---- > src/include/executor/executor.h | 1 + > src/include/executor/instrument.h | 53 ++++- > 8 files changed, 260 insertions(+), 63 deletions(-) > > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c > index f43a33b3787..eeabd820d8e 100644 > --- a/contrib/pg_stat_statements/pg_stat_statements.c > +++ b/contrib/pg_stat_statements/pg_stat_statements.c > @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc) > PGSS_EXEC, > INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total), > queryDesc->estate->es_total_processed, > - &queryDesc->totaltime->bufusage, > - &queryDesc->totaltime->walusage, > + &INSTR_GET_BUFUSAGE(queryDesc->totaltime), > + &INSTR_GET_WALUSAGE(queryDesc->totaltime), Getting a pointer to something returned by a macro is a bit ugly... Perhaps it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not easily possible given how pgss_planner() currently tracks things :( Maybe it's worth refactoring this a bit in a precursor patch? > @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, > resultRelInfo->ri_TrigWhenExprs = (ExprState **) > palloc0(n * sizeof(ExprState *)); > if (instrument_options) > - resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options); > + { > + if ((instrument_options & INSTRUMENT_TIMER) != 0) > + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER); > + else > + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS); > + } > } > else > { I'd not duplicate the InstrAlloc(), but compute the flags separately. > /* ------------------------------------------------------------------------ > @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context) > return false; > } > > +/* > + * ExecAccumNodeInstrumentation > + * > + * Accumulate instrumentation stats from all execution nodes to their respective > + * parents (or the original parent instrumentation stack). > + */ > +void > +ExecAccumNodeInstrumentation(PlanState *node) > +{ > + (void) ExecAccumNodeInstrumentation_walker(node, NULL); > +} I wonder if this is too narrow a name. There might be other uses of a pass across the node tree at that point. OTOH, it's probably better to just rename it at that later point. > +static bool > +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context) > +{ > + if (node == NULL) > + return false; > + > + check_stack_depth(); > + > + planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context); There already is a check_stack_depth() in planstate_tree_walker(). > + if (node->instrument && node->instrument->stack.previous) > + InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack); > + > + return false; > +} E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also have special handling for a few node types. Do we need something like that here too? It probably is ok, but it's worth explicitly checking and adding a comment. > +/* > + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort. > + */ > +static void ResOwnerReleaseInstrumentation(Datum res); > +static const ResourceOwnerDesc instrumentation_resowner_desc = > +{ > + .name = "instrumentation", > + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS, > + .release_priority = RELEASE_PRIO_FIRST, > + .ReleaseResource = ResOwnerReleaseInstrumentation, > + .DebugPrint = NULL, /* default message is fine */ > +}; Is there a reason to do the release here before the lock release? And why _FIRST? > +static void > +ResOwnerReleaseInstrumentation(Datum res) > +{ > + Instrumentation *instr = (Instrumentation *) DatumGetPointer(res); > + > + /* > + * Because registered resources are *not* called in reverse order, we'll > + * get what was first registered first at shutdown. Thus, on any later > + * resources we need to not change the stack, which was already set to the > + * correct previous entry. > + */ FWIW, the release order is not guaranteed to be in that order either, e.g. once resowner switches to hashing, it'll essentially be random. > + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack)) > + pgInstrStack = instr->stack.previous; Hm - this is effectively O(stack-depth^2), right? It's probably fine, given that we have fairly limited nesting (explain + pg_stat_statements + auto_explain is probably the current max), but seems worth noting in a comment? > + /* > + * Always accumulate all collected stats before the abort, even if we > + * already walked up the stack with an earlier resource. > + */ > + if (pgInstrStack) > + InstrStackAdd(pgInstrStack, &instr->stack); Why are we accumulating stats in case of errors? It's probably fine, but doing less as part of cleanup is pre ferrable... > /* General purpose instrumentation handling */ > Instrumentation * > InstrAlloc(int n, int instrument_options) > { > - Instrumentation *instr; > + Instrumentation *instr = NULL; > + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0; > + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0; > + int i; > + > + /* > + * If resource owner will be used, we must allocate in the transaction > + * context (not the calling context, usually a lower context), because the > + * memory might otherwise be freed too early in an abort situation. > + */ > + if (need_buffers || need_wal) > + instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation)); > + else > + instr = palloc0(n * sizeof(Instrumentation)); Is this long-lived enough? I'm e.g. wondering about utility statements that internally starting transactions, wouldn't that cause problems with a user like pgss tracking something like CIC? > - /* initialize all fields to zeroes, then modify as needed */ > - instr = palloc0(n * sizeof(Instrumentation)); > - if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL)) > + for (i = 0; i < n; i++) > { > - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0; > - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > - bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0; > - int i; > - > - for (i = 0; i < n; i++) > - { > - instr[i].need_bufusage = need_buffers; > - instr[i].need_walusage = need_wal; > - instr[i].need_timer = need_timer; > - } > + instr[i].need_bufusage = need_buffers; > + instr[i].need_walusage = need_wal; > + instr[i].need_timer = need_timer; > } > > return instr; > } > + > void > InstrStart(Instrumentation *instr) > { > + Assert(!instr->finalized); > + > if (instr->need_timer && > !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime)) > elog(ERROR, "InstrStart called twice in a row"); > > - if (instr->need_bufusage) > - instr->bufusage_start = pgBufferUsage; > - > - if (instr->need_walusage) > - instr->walusage_start = pgWalUsage; > + if (instr->need_bufusage || instr->need_walusage) > + InstrPushStackResource(instr); > } > + > void > -InstrStop(Instrumentation *instr, double nTuples) > +InstrStop(Instrumentation *instr, double nTuples, bool finalize) > { > instr_time endtime; > > @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples) > INSTR_TIME_SET_ZERO(instr->starttime); > } > > - /* Add delta of buffer usage since entry to node's totals */ > - if (instr->need_bufusage) > - BufferUsageAccumDiff(&instr->bufusage, > - &pgBufferUsage, &instr->bufusage_start); > + if (instr->need_bufusage || instr->need_walusage) > + InstrPopStackResource(instr); > > - if (instr->need_walusage) > - WalUsageAccumDiff(&instr->walusage, > - &pgWalUsage, &instr->walusage_start); > + if (finalize) > + { > + instr->finalized = true; > + if (pgInstrStack) > + InstrStackAdd(pgInstrStack, &instr->stack); > + } > } > > /* Allocate new node instrumentation structure(s) */ > @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr) > !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime)) > elog(ERROR, "InstrStartNode called twice in a row"); > > - /* save buffer usage totals at node entry, if needed */ > - if (instr->need_bufusage) > - instr->bufusage_start = pgBufferUsage; > + if (instr->need_bufusage || instr->need_walusage) > + { > + /* Ensure that we always have a parent, even at the top most node */ > + Assert(pgInstrStack != NULL); > > - if (instr->need_walusage) > - instr->walusage_start = pgWalUsage; > + instr->stack.previous = pgInstrStack; > + pgInstrStack = &instr->stack; > + } > } > > /* Exit from a plan node */ > @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples) > INSTR_TIME_SET_ZERO(instr->starttime); > } > > - /* Add delta of buffer usage since entry to node's totals */ > - if (instr->need_bufusage) > - BufferUsageAccumDiff(&instr->bufusage, > - &pgBufferUsage, &instr->bufusage_start); > - > - if (instr->need_walusage) > - WalUsageAccumDiff(&instr->walusage, > - &pgWalUsage, &instr->walusage_start); > + if (instr->need_bufusage || instr->need_walusage) > + { > + /* Ensure that we always have a parent, even at the top most node */ > + Assert(instr->stack.previous != NULL); > + pgInstrStack = instr->stack.previous; > + } > > /* Is this the first tuple of this cycle? */ > if (!instr->running) > @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add) > > /* Add delta of buffer usage since entry to node's totals */ > if (dst->need_bufusage) > - BufferUsageAdd(&dst->bufusage, &add->bufusage); > + BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage); > > if (dst->need_walusage) > - WalUsageAdd(&dst->walusage, &add->walusage); > + WalUsageAdd(&dst->stack.walusage, &add->stack.walusage); > +} > + > +void > +InstrStackAdd(InstrStack * dst, InstrStack * add) > +{ > + Assert(dst != NULL); > + Assert(add != NULL); > + > + BufferUsageAdd(&dst->bufusage, &add->bufusage); > + WalUsageAdd(&dst->walusage, &add->walusage); > } > Do we want to do BufferUsageAdd() etc even if we are not tracking buffer usage? Those operations aren't cheap... > /* note current values during parallel executor startup */ > @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage) > void > InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage) > { > + if (pgInstrStack != NULL) > + { > + InstrStack *dst = pgInstrStack; > + > + BufferUsageAdd(&dst->bufusage, bufusage); > + WalUsageAdd(&dst->walusage, walusage); > + } > + > BufferUsageAdd(&pgBufferUsage, bufusage); > WalUsageAdd(&pgWalUsage, walusage); > } Is the pgInstrStack == NULL case actually reachable? > diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h > index 78d3653997b..d04607ce40c 100644 > --- a/src/include/executor/instrument.h > +++ b/src/include/executor/instrument.h > @@ -14,6 +14,7 @@ > #define INSTRUMENT_H > > #include "portability/instr_time.h" > +#include "utils/resowner.h" I'd probably not include resowner here but just forward declare the typedef. > @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage); > extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage); > extern void BufferUsageAccumDiff(BufferUsage *dst, > const BufferUsage *add, const BufferUsage *sub); > +extern void InstrStackAdd(InstrStack * dst, InstrStack * add); > extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add, > const WalUsage *sub); > > +#define INSTR_GET_BUFUSAGE(instr) \ > + instr->stack.bufusage > + > +#define INSTR_GET_WALUSAGE(instr) \ > + instr->stack.walusage Not convinced that having these macros is worthwhile. At this point I reached return -ENEEDCOFFEE :) Greetings, Andres Freund
pgsql-hackers by date: